package parquet

Import Path
	github.com/parquet-go/parquet-go (on go.dev)

Dependency Relation
	imports 56 packages, and imported by 13 packages


Code Examples { // parquet-go uses the same struct-tag definition style as JSON and XML type Contact struct { Name string `parquet:"name"` // "zstd" specifies the compression for this column PhoneNumber string `parquet:"phoneNumber,optional,zstd"` } type AddressBook struct { Owner string `parquet:"owner,zstd"` OwnerPhoneNumbers []string `parquet:"ownerPhoneNumbers,gzip"` Contacts []Contact `parquet:"contacts"` } f, _ := ioutil.TempFile("", "parquet-example-") writer := parquet.NewWriter(f) rows := []AddressBook{ {Owner: "UserA", Contacts: []Contact{ {Name: "Alice", PhoneNumber: "+15505551234"}, {Name: "Bob"}, }}, } for _, row := range rows { if err := writer.Write(row); err != nil { log.Fatal(err) } } _ = writer.Close() _ = f.Close() rf, _ := os.Open(f.Name()) pf := parquet.NewReader(rf) addrs := make([]AddressBook, 0) for { var addr AddressBook err := pf.Read(&addr) if err == io.EOF { break } if err != nil { log.Fatal(err) } addrs = append(addrs, addr) } fmt.Println(addrs[0].Owner) }
Package-Level Type Names (total 90)
/* sort by: | */
BloomFilter is an interface allowing applications to test whether a key exists in a bloom filter. Tests whether the given value is present in the filter. A non-nil error may be returned if reading the filter failed. This may happen if the filter was lazily loaded from a storage medium during the call to Check for example. Applications that can guarantee that the filter was in memory at the time Check was called can safely ignore the error, which would always be nil in this case. ( BloomFilter) ReadAt(p []byte, off int64) (n int, err error) Returns the size of the bloom filter (in bytes). BloomFilter : io.ReaderAt func ColumnBuffer.BloomFilter() BloomFilter func ColumnChunk.BloomFilter() BloomFilter func github.com/polarsignals/frostdb/dynparquet.(*NilColumnChunk).BloomFilter() BloomFilter
The BloomFilterColumn interface is a declarative representation of bloom filters used when configuring filters on a parquet writer. Returns an encoding which can be used to write columns of values to the filter. Returns the hashing algorithm used when inserting values into a bloom filter. Returns the path of the column that the filter applies to. Returns the size of the filter needed to encode values in the filter, assuming each value will be encoded with the given number of bits. func SplitBlockFilter(bitsPerValue uint, path ...string) BloomFilterColumn func BloomFilters(filters ...BloomFilterColumn) WriterOption
BooleanReader is an interface implemented by ValueReader instances which expose the content of a column of boolean values. Read boolean values into the buffer passed as argument. The method returns io.EOF when all values have been read.
BooleanWriter is an interface implemented by ValueWriter instances which support writing columns of boolean values. Write boolean values. The method returns the number of values written, and any error that occurred while writing the values.
Buffer represents an in-memory group of parquet rows. The main purpose of the Buffer type is to provide a way to sort rows before writing them to a parquet file. Buffer implements sort.Interface as a way to support reordering the rows that have been written to it. ColumnBuffer returns the buffer columns. This method is similar to ColumnChunks, but returns a list of ColumnBuffer instead of a ColumnChunk values (the latter being read-only); calling ColumnBuffers or ColumnChunks with the same index returns the same underlying objects, but with different types, which removes the need for making a type assertion if the program needed to write directly to the column buffers. The presence of the ColumnChunks method is still required to satisfy the RowGroup interface. ColumnChunks returns the buffer columns. Len returns the number of rows written to the buffer. Less returns true if row[i] < row[j] in the buffer. NumRows returns the number of rows written to the buffer. Reset clears the content of the buffer, allowing it to be reused. Rows returns a reader exposing the current content of the buffer. The buffer and the returned reader share memory. Mutating the buffer concurrently to reading rows may result in non-deterministic behavior. Schema returns the schema of the buffer. The schema is either configured by passing a Schema in the option list when constructing the buffer, or lazily discovered when the first row is written. Size returns the estimated size of the buffer in memory (in bytes). SortingColumns returns the list of columns by which the buffer will be sorted. The sorting order is configured by passing a SortingColumns option when constructing the buffer. Swap exchanges the rows at indexes i and j. Write writes a row held in a Go value to the buffer. WriteRowGroup satisfies the RowGroupWriter interface. WriteRows writes parquet rows to the buffer. *Buffer : RowGroup *Buffer : RowGroupWriter *Buffer : RowWriter *Buffer : RowWriterWithSchema *Buffer : github.com/polarsignals/frostdb/query/expr.Particulate *Buffer : sort.Interface func NewBuffer(options ...RowGroupOption) *Buffer
BufferPool is an interface abstracting the underlying implementation of page buffer pools. The parquet-go package provides two implementations of this interface, one backed by in-memory buffers (on the Go heap), and the other using temporary files on disk. Applications which need finer grain control over the allocation and retention of page buffers may choose to provide their own implementation and install it via the parquet.ColumnPageBuffers writer option. BufferPool implementations must be safe to use concurrently from multiple goroutines. GetBuffer is called when a parquet writer needs to acquire a new page buffer from the pool. PutBuffer is called when a parquet writer releases a page buffer to the pool. The parquet.Writer type guarantees that the buffers it calls this method with were previously acquired by a call to GetBuffer on the same pool, and that it will not use them anymore after the call. func NewBufferPool() BufferPool func NewChunkBufferPool(chunkSize int) BufferPool func NewFileBufferPool(tempdir, pattern string) BufferPool func ColumnPageBuffers(buffers BufferPool) WriterOption func SortingBuffers(buffers BufferPool) SortingOption
ByteArrayReader is an interface implemented by ValueReader instances which expose the content of a column of variable length byte array values. Read values into the byte buffer passed as argument, returning the number of values written to the buffer (not the number of bytes). Values are written using the PLAIN encoding, each byte array prefixed with its length encoded as a 4 bytes little endian unsigned integer. The method returns io.EOF when all values have been read. If the buffer was not empty, but too small to hold at least one value, io.ErrShortBuffer is returned.
ByteArrayWriter is an interface implemented by ValueWriter instances which support writing columns of variable length byte array values. Write variable length byte array values. The values passed as input must be laid out using the PLAIN encoding, with each byte array prefixed with the four bytes little endian unsigned integer length. The method returns the number of values written to the underlying column (not the number of bytes), or any error that occurred while attempting to write the values.
Column represents a column in a parquet file. Methods of Column values are safe to call concurrently from multiple goroutines. Column instances satisfy the Node interface. Column returns the child column matching the given name. Columns returns the list of child columns. The method returns the same slice across multiple calls, the program must treat it as a read-only value. Compression returns the compression codecs used by this column. DecodeDataPageV1 decodes a data page from the header, compressed data, and optional dictionary passed as arguments. DecodeDataPageV2 decodes a data page from the header, compressed data, and optional dictionary passed as arguments. DecodeDictionary decodes a data page from the header and compressed data passed as arguments. Depth returns the position of the column relative to the root. Encoding returns the encodings used by this column. Fields returns the list of fields on the column. GoType returns the Go type that best represents the parquet column. ID returns column field id Index returns the position of the column in a row. Only leaf columns have a column index, the method returns -1 when called on non-leaf columns. Leaf returns true if c is a leaf column. MaxDefinitionLevel returns the maximum value of definition levels on this column. MaxRepetitionLevel returns the maximum value of repetition levels on this column. Name returns the column name. Optional returns true if the column is optional. Pages returns a reader exposing all pages in this column, across row groups. Path of the column in the parquet schema. Repeated returns true if the column may repeat. Required returns true if the column is required. String returns a human-readable string representation of the column. Type returns the type of the column. The returned value is unspecified if c is not a leaf column. Value returns the sub-value in base for the child column at the given index. *Column : Field *Column : Node *Column : github.com/polarsignals/frostdb/query/logicalplan.Named *Column : expvar.Var *Column : fmt.Stringer func (*Column).Column(name string) *Column func (*Column).Columns() []*Column func (*File).Root() *Column
ColumnBuffer is an interface representing columns of a row group. ColumnBuffer implements sort.Interface as a way to support reordering the rows that have been written to it. The current implementation has a limitation which prevents applications from providing custom versions of this interface because it contains unexported methods. The only way to create ColumnBuffer values is to call the NewColumnBuffer of Type instances. This limitation may be lifted in future releases. ( ColumnBuffer) BloomFilter() BloomFilter Returns the current capacity of the column (rows). Returns a copy of the column. The returned copy shares no memory with the original, mutations of either column will not modify the other. Returns the index of this column in its parent row group. Returns the components of the page index for this column chunk, containing details about the content and location of pages within the chunk. Note that the returned value may be the same across calls to these methods, programs must treat those as read-only. If the column chunk does not have a column or offset index, the methods return ErrMissingColumnIndex or ErrMissingOffsetIndex respectively. Prior to v0.20, these methods did not return an error because the page index for a file was either fully read when the file was opened, or skipped completely using the parquet.SkipPageIndex option. Version v0.20 introduced a change that the page index can be read on-demand at any time, even if a file was opened with the parquet.SkipPageIndex option. Since reading the page index can fail, these methods now return an error. For indexed columns, returns the underlying dictionary holding the column values. If the column is not indexed, nil is returned. Returns the number of rows currently written to the column. Compares rows at index i and j and reports whether i < j. Returns the number of values in the column chunk. This quantity may differ from the number of rows in the parent row group because repeated columns may hold zero or more values per row. ( ColumnBuffer) OffsetIndex() (OffsetIndex, error) Returns the column as a Page. Returns a reader exposing the pages of the column. ( ColumnBuffer) ReadValuesAt([]Value, int64) (int, error) Clears all rows written to the column. Returns the size of the column buffer in bytes. Swaps rows at index i and j. Returns the column type. Write values from the buffer passed as argument and returns the number of values written. ColumnBuffer : ColumnChunk ColumnBuffer : ValueReaderAt ColumnBuffer : ValueWriter ColumnBuffer : sort.Interface func (*Buffer).ColumnBuffers() []ColumnBuffer func ColumnBuffer.Clone() ColumnBuffer func (*GenericBuffer)[T].ColumnBuffers() []ColumnBuffer func Type.NewColumnBuffer(columnIndex, numValues int) ColumnBuffer
The ColumnChunk interface represents individual columns of a row group. ( ColumnChunk) BloomFilter() BloomFilter Returns the index of this column in its parent row group. Returns the components of the page index for this column chunk, containing details about the content and location of pages within the chunk. Note that the returned value may be the same across calls to these methods, programs must treat those as read-only. If the column chunk does not have a column or offset index, the methods return ErrMissingColumnIndex or ErrMissingOffsetIndex respectively. Prior to v0.20, these methods did not return an error because the page index for a file was either fully read when the file was opened, or skipped completely using the parquet.SkipPageIndex option. Version v0.20 introduced a change that the page index can be read on-demand at any time, even if a file was opened with the parquet.SkipPageIndex option. Since reading the page index can fail, these methods now return an error. Returns the number of values in the column chunk. This quantity may differ from the number of rows in the parent row group because repeated columns may hold zero or more values per row. ( ColumnChunk) OffsetIndex() (OffsetIndex, error) Returns a reader exposing the pages of the column. Returns the column type. ColumnBuffer (interface) *github.com/polarsignals/frostdb/dynparquet.NilColumnChunk func (*Buffer).ColumnChunks() []ColumnChunk func (*GenericBuffer)[T].ColumnChunks() []ColumnChunk func (*RowBuffer)[T].ColumnChunks() []ColumnChunk func RowGroup.ColumnChunks() []ColumnChunk func github.com/polarsignals/frostdb/dynparquet.(*Buffer).ColumnChunks() []ColumnChunk func github.com/polarsignals/frostdb/dynparquet.DynamicRowGroup.ColumnChunks() []ColumnChunk func github.com/polarsignals/frostdb/dynparquet.(*DynamicRowGroupMergeAdapter).ColumnChunks() []ColumnChunk func github.com/polarsignals/frostdb/index.ReleaseableRowGroup.ColumnChunks() []ColumnChunk func github.com/polarsignals/frostdb/query/expr.(*ColumnRef).Column(p expr.Particulate) (ColumnChunk, bool, error) func github.com/polarsignals/frostdb/query/expr.Particulate.ColumnChunks() []ColumnChunk func PrintColumnChunk(w io.Writer, columnChunk ColumnChunk) error func github.com/polarsignals/frostdb/query/expr.BinaryScalarOperation(left ColumnChunk, right Value, operator logicalplan.Op) (bool, error)
IsAscending returns true if the column index min/max values are sorted in ascending order (based on the ordering rules of the column's logical type). IsDescending returns true if the column index min/max values are sorted in descending order (based on the ordering rules of the column's logical type). ( ColumnIndex) MaxValue(int) Value PageIndex return min/max bounds for the page at the given index in the column. Returns the number of null values in the page at the given index. Tells whether the page at the given index contains null values only. NumPages returns the number of paged in the column index. func NewColumnIndex(kind Kind, index *format.ColumnIndex) ColumnIndex func ColumnBuffer.ColumnIndex() (ColumnIndex, error) func ColumnChunk.ColumnIndex() (ColumnIndex, error) func github.com/polarsignals/frostdb/dynparquet.(*NilColumnChunk).ColumnIndex() (ColumnIndex, error) func Find(index ColumnIndex, value Value, cmp func(Value, Value) int) int func Search(index ColumnIndex, value Value, typ Type) int func github.com/polarsignals/frostdb/query/expr.Max(columnIndex ColumnIndex) Value func github.com/polarsignals/frostdb/query/expr.Min(columnIndex ColumnIndex) Value func github.com/polarsignals/frostdb/query/expr.NullCount(columnIndex ColumnIndex) int64
The ColumnIndexer interface is implemented by types that support generating parquet column indexes. The package does not export any types that implement this interface, programs must call NewColumnIndexer on a Type instance to construct column indexers. Generates a format.ColumnIndex value from the current state of the column indexer. The returned value may reference internal buffers, in which case the values remain valid until the next call to IndexPage or Reset on the column indexer. Add a page to the column indexer. Resets the column indexer state. func Type.NewColumnIndexer(sizeLimit int) ColumnIndexer
Conversion is an interface implemented by types that provide conversion of parquet rows from one schema to another. Conversion instances must be safe to use concurrently from multiple goroutines. Converts the given column index in the target schema to the original column index in the source schema of the conversion. Applies the conversion logic on the src row, returning the result appended to dst. Returns the target schema of the conversion. func Convert(to, from Node) (conv Conversion, err error) func ConvertRowGroup(rowGroup RowGroup, conv Conversion) RowGroup func ConvertRowReader(rows RowReader, conv Conversion) RowReaderWithSchema
ConvertError is an error type returned by calls to Convert when the conversion of parquet schemas is impossible or the input row for the conversion is malformed. From Node Path []string To Node Error satisfies the error interface. *ConvertError : error
DataPageHeader is a specialization of the PageHeader interface implemented by data pages. Returns the encoding of the definition level section. Returns the page encoding. Returns the maximum value in the page based on the ordering rules of the column's logical type. As an optimization, the method may return the same slice across multiple calls. Programs must treat the returned value as immutable to prevent unpredictable behaviors. If the page only contains only null values, an empty slice is returned. Returns the minimum value in the page based on the ordering rules of the column's logical type. As an optimization, the method may return the same slice across multiple calls. Programs must treat the returned value as immutable to prevent unpredictable behaviors. If the page only contains only null values, an empty slice is returned. Returns the number of null values in the page. Returns the number of values in the page (including nulls). Returns the parquet format page type. Returns the encoding of the repetition level section. DataPageHeaderV1 DataPageHeaderV2 DataPageHeader : PageHeader
DataPageHeaderV1 is an implementation of the DataPageHeader interface representing data pages version 1. ( DataPageHeaderV1) DefinitionLevelEncoding() format.Encoding ( DataPageHeaderV1) Encoding() format.Encoding ( DataPageHeaderV1) MaxValue() []byte ( DataPageHeaderV1) MinValue() []byte ( DataPageHeaderV1) NullCount() int64 ( DataPageHeaderV1) NumValues() int64 ( DataPageHeaderV1) PageType() format.PageType ( DataPageHeaderV1) RepetitionLevelEncoding() format.Encoding ( DataPageHeaderV1) String() string DataPageHeaderV1 : DataPageHeader DataPageHeaderV1 : PageHeader DataPageHeaderV1 : expvar.Var DataPageHeaderV1 : fmt.Stringer func (*Column).DecodeDataPageV1(header DataPageHeaderV1, page []byte, dict Dictionary) (Page, error)
DataPageHeaderV2 is an implementation of the DataPageHeader interface representing data pages version 2. ( DataPageHeaderV2) DefinitionLevelEncoding() format.Encoding ( DataPageHeaderV2) DefinitionLevelsByteLength() int64 ( DataPageHeaderV2) Encoding() format.Encoding ( DataPageHeaderV2) IsCompressed() bool ( DataPageHeaderV2) MaxValue() []byte ( DataPageHeaderV2) MinValue() []byte ( DataPageHeaderV2) NullCount() int64 ( DataPageHeaderV2) NumNulls() int64 ( DataPageHeaderV2) NumRows() int64 ( DataPageHeaderV2) NumValues() int64 ( DataPageHeaderV2) PageType() format.PageType ( DataPageHeaderV2) RepetitionLevelEncoding() format.Encoding ( DataPageHeaderV2) RepetitionLevelsByteLength() int64 ( DataPageHeaderV2) String() string DataPageHeaderV2 : DataPageHeader DataPageHeaderV2 : PageHeader DataPageHeaderV2 : expvar.Var DataPageHeaderV2 : fmt.Stringer func (*Column).DecodeDataPageV2(header DataPageHeaderV2, page []byte, dict Dictionary) (Page, error)
The Dictionary interface represents type-specific implementations of parquet dictionaries. Programs can instantiate dictionaries by call the NewDictionary method of a Type object. The current implementation has a limitation which prevents applications from providing custom versions of this interface because it contains unexported methods. The only way to create Dictionary values is to call the NewDictionary of Type instances. This limitation may be lifted in future releases. Returns the min and max values found in the given indexes. Returns the dictionary value at the given index. Inserts values from the second slice to the dictionary and writes the indexes at which each value was inserted to the first slice. The method panics if the length of the indexes slice is smaller than the length of the values slice. Returns the number of value indexed in the dictionary. Given an array of dictionary indexes, lookup the values into the array of values passed as second argument. The method panics if len(indexes) > len(values), or one of the indexes is negative or greater than the highest index in the dictionary. Returns a Page representing the content of the dictionary. The returned page shares the underlying memory of the buffer, it remains valid to use until the dictionary's Reset method is called. Resets the dictionary to its initial state, removing all values. Returns the type that the dictionary was created from. func (*Column).DecodeDictionary(header DictionaryPageHeader, page []byte) (Dictionary, error) func ColumnBuffer.Dictionary() Dictionary func Page.Dictionary() Dictionary func Type.NewDictionary(columnIndex, numValues int, data encoding.Values) Dictionary func (*Column).DecodeDataPageV1(header DataPageHeaderV1, page []byte, dict Dictionary) (Page, error) func (*Column).DecodeDataPageV2(header DataPageHeaderV2, page []byte, dict Dictionary) (Page, error)
DictionaryPageHeader is an implementation of the PageHeader interface representing dictionary pages. ( DictionaryPageHeader) Encoding() format.Encoding ( DictionaryPageHeader) IsSorted() bool ( DictionaryPageHeader) NumValues() int64 ( DictionaryPageHeader) PageType() format.PageType ( DictionaryPageHeader) String() string DictionaryPageHeader : PageHeader DictionaryPageHeader : expvar.Var DictionaryPageHeader : fmt.Stringer func (*Column).DecodeDictionary(header DictionaryPageHeader, page []byte) (Dictionary, error)
DoubleReader is an interface implemented by ValueReader instances which expose the content of a column of double-precision float point values. Read double-precision floating point values into the buffer passed as argument. The method returns io.EOF when all values have been read.
DoubleWriter is an interface implemented by ValueWriter instances which support writing columns of double-precision floating point values. Write double-precision floating point values. The method returns the number of values written, and any error that occurred while writing the values.
Field instances represent fields of a parquet node, which associate a node to their name in their parent node. Returns compression codec used by the node. The method may return nil to indicate that no specific compression codec was configured on the node, in which case a default compression might be used. Returns the encoding used by the node. The method may return nil to indicate that no specific encoding was configured on the node, in which case a default encoding might be used. Returns a mapping of the node's fields. As an optimization, the same slices may be returned by multiple calls to this method, programs must treat the returned values as immutable. This method returns an empty mapping when called on leaf nodes. Returns the Go type that best represents the parquet node. For leaf nodes, this will be one of bool, int32, int64, deprecated.Int96, float32, float64, string, []byte, or [N]byte. For groups, the method returns a struct type. If the method is called on a repeated node, the method returns a slice of the underlying type. For optional nodes, the method returns a pointer of the underlying type. For nodes that were constructed from Go values (e.g. using SchemaOf), the method returns the original Go type. The id of this node in its parent node. Zero value is treated as id is not set. ID only needs to be unique within its parent context. This is the same as parquet field_id Returns true if this a leaf node. Returns the name of this field in its parent node. Returns whether the parquet column is optional. Returns whether the parquet column is repeated. Returns whether the parquet column is required. Returns a human-readable representation of the parquet node. For leaf nodes, returns the type of values of the parquet column. Calling this method on non-leaf nodes will panic. Given a reference to the Go value matching the structure of the parent node, returns the Go value of the field. *Column Field : Node Field : github.com/polarsignals/frostdb/query/logicalplan.Named Field : expvar.Var Field : fmt.Stringer func (*Column).Fields() []Field func Field.Fields() []Field func Group.Fields() []Field func Node.Fields() []Field func (*Schema).Fields() []Field func github.com/polarsignals/frostdb/dynparquet.FieldByName(fields []Field, name string) Field func github.com/polarsignals/frostdb/dynparquet.Concat(fields []Field, drg ...dynparquet.DynamicRowGroup) dynparquet.DynamicRowGroup func github.com/polarsignals/frostdb/dynparquet.FieldByName(fields []Field, name string) Field func github.com/polarsignals/frostdb/dynparquet.FindChildIndex(fields []Field, name string) int func github.com/polarsignals/frostdb/dynparquet.NewDynamicRow(row Row, schema *Schema, dyncols map[string][]string, fields []Field) *dynparquet.DynamicRow func github.com/polarsignals/frostdb/dynparquet.NewDynamicRows(rows []Row, schema *Schema, dynamicColumns map[string][]string, fields []Field) *dynparquet.DynamicRows func github.com/polarsignals/frostdb/pqarrow.SingleMatchingColumn(distinctColumns []logicalplan.Expr, fields []Field) bool func github.com/polarsignals/frostdb/pqarrow/convert.ParquetFieldToArrowField(pf Field) (arrow.Field, error)
File represents a parquet file. The layout of a Parquet file can be found here: https://github.com/apache/parquet-format#file-format ColumnIndexes returns the page index of the parquet file f. If the file did not contain a column index, the method returns an empty slice and nil error. Lookup returns the value associated with the given key in the file key/value metadata. The ok boolean will be true if the key was found, false otherwise. Metadata returns the metadata of f. NumRows returns the number of rows in the file. OffsetIndexes returns the page index of the parquet file f. If the file did not contain an offset index, the method returns an empty slice and nil error. ReadAt reads bytes into b from f at the given offset. The method satisfies the io.ReaderAt interface. ReadPageIndex reads the page index section of the parquet file f. If the file did not contain a page index, the method returns two empty slices and a nil error. Only leaf columns have indexes, the returned indexes are arranged using the following layout: ------------------ | col 0: chunk 0 | ------------------ | col 1: chunk 0 | ------------------ | ... | ------------------ | col 0: chunk 1 | ------------------ | col 1: chunk 1 | ------------------ | ... | ------------------ This method is useful in combination with the SkipPageIndex option to delay reading the page index section until after the file was opened. Note that in this case the page index is not cached within the file, programs are expected to make use of independently from the parquet package. Root returns the root column of f. RowGroups returns the list of row groups in the file. Schema returns the schema of f. Size returns the size of f (in bytes). *File : github.com/sethvargo/go-envconfig.Lookuper *File : golang.org/x/text/internal/catmsg.Dictionary *File : golang.org/x/text/message/catalog.Dictionary *File : io.ReaderAt func OpenFile(r io.ReaderAt, size int64, options ...FileOption) (*File, error) func github.com/polarsignals/frostdb/dynparquet.(*SerializedBuffer).ParquetFile() *File func github.com/polarsignals/frostdb/dynparquet.DefinitionFromParquetFile(file *File) (*schemapb.Schema, error) func github.com/polarsignals/frostdb/dynparquet.NewSerializedBuffer(f *File) (*dynparquet.SerializedBuffer, error) func github.com/polarsignals/frostdb/dynparquet.SchemaFromParquetFile(file *File) (*dynparquet.Schema, error)
The FileConfig type carries configuration options for parquet files. FileConfig implements the FileOption interface so it can be used directly as argument to the OpenFile function when needed, for example: f, err := parquet.OpenFile(reader, size, &parquet.FileConfig{ SkipPageIndex: true, SkipBloomFilters: true, ReadMode: ReadModeAsync, }) ReadBufferSize int ReadMode ReadMode Schema *Schema SkipBloomFilters bool SkipPageIndex bool Apply applies the given list of options to c. ConfigureFile applies configuration options from c to config. Validate returns a non-nil error if the configuration of c is invalid. *FileConfig : FileOption func DefaultFileConfig() *FileConfig func NewFileConfig(options ...FileOption) (*FileConfig, error) func (*FileConfig).ConfigureFile(config *FileConfig) func FileOption.ConfigureFile(*FileConfig)
FileOption is an interface implemented by types that carry configuration options for parquet files. ( FileOption) ConfigureFile(*FileConfig) *FileConfig func FileReadMode(mode ReadMode) FileOption func FileSchema(schema *Schema) FileOption func ReadBufferSize(size int) FileOption func SkipBloomFilters(skip bool) FileOption func SkipPageIndex(skip bool) FileOption func NewFileConfig(options ...FileOption) (*FileConfig, error) func OpenFile(r io.ReaderAt, size int64, options ...FileOption) (*File, error) func (*FileConfig).Apply(options ...FileOption)
FixedLenByteArrayReader is an interface implemented by ValueReader instances which expose the content of a column of fixed length byte array values. Read values into the byte buffer passed as argument, returning the number of values written to the buffer (not the number of bytes). The method returns io.EOF when all values have been read. If the buffer was not empty, but too small to hold at least one value, io.ErrShortBuffer is returned.
FixedLenByteArrayWriter is an interface implemented by ValueWriter instances which support writing columns of fixed length byte array values. Writes the fixed length byte array values. The size of the values is assumed to be the same as the expected size of items in the column. The method errors if the length of the input values is not a multiple of the expected item size.
FloatReader is an interface implemented by ValueReader instances which expose the content of a column of single-precision floating point values. Read single-precision floating point values into the buffer passed as argument. The method returns io.EOF when all values have been read.
FloatWriter is an interface implemented by ValueWriter instances which support writing columns of single-precision floating point values. Write single-precision floating point values. The method returns the number of values written, and any error that occurred while writing the values.
Type Parameters: T: any GenericBuffer is similar to a Buffer but uses a type parameter to define the Go type representing the schema of rows in the buffer. See GenericWriter for details about the benefits over the classic Buffer API. (*GenericBuffer[T]) ColumnBuffers() []ColumnBuffer (*GenericBuffer[T]) ColumnChunks() []ColumnChunk (*GenericBuffer[T]) Len() int (*GenericBuffer[T]) Less(i, j int) bool (*GenericBuffer[T]) NumRows() int64 (*GenericBuffer[T]) Reset() (*GenericBuffer[T]) Rows() Rows (*GenericBuffer[T]) Schema() *Schema (*GenericBuffer[T]) Size() int64 (*GenericBuffer[T]) SortingColumns() []SortingColumn (*GenericBuffer[T]) Swap(i, j int) (*GenericBuffer[T]) Write(rows []T) (int, error) (*GenericBuffer[T]) WriteRowGroup(rowGroup RowGroup) (int64, error) (*GenericBuffer[T]) WriteRows(rows []Row) (int, error) *GenericBuffer : RowGroup *GenericBuffer : RowGroupWriter *GenericBuffer : RowWriter *GenericBuffer : RowWriterWithSchema *GenericBuffer : github.com/polarsignals/frostdb/query/expr.Particulate *GenericBuffer : sort.Interface func NewGenericBuffer[T](options ...RowGroupOption) *GenericBuffer[T]
Type Parameters: T: any GenericReader is similar to a Reader but uses a type parameter to define the Go type representing the schema of rows being read. See GenericWriter for details about the benefits over the classic Reader API. (*GenericReader[T]) Close() error (*GenericReader[T]) NumRows() int64 Read reads the next rows from the reader into the given rows slice up to len(rows). The returned values are safe to reuse across Read calls and do not share memory with the reader's underlying page buffers. The method returns the number of rows read and io.EOF when no more rows can be read from the reader. (*GenericReader[T]) ReadRows(rows []Row) (int, error) (*GenericReader[T]) Reset() (*GenericReader[T]) Schema() *Schema (*GenericReader[T]) SeekToRow(rowIndex int64) error *GenericReader : RowReader *GenericReader : RowReaderWithSchema *GenericReader : RowReadSeeker *GenericReader : RowSeeker *GenericReader : Rows *GenericReader : github.com/prometheus/common/expfmt.Closer *GenericReader : io.Closer func NewGenericReader[T](input io.ReaderAt, options ...ReaderOption) *GenericReader[T] func NewGenericRowGroupReader[T](rowGroup RowGroup, options ...ReaderOption) *GenericReader[T] func github.com/polarsignals/frostdb/dynparquet.(*SerializedBuffer).Reader() *GenericReader[any]
Type Parameters: T: any GenericWriter is similar to a Writer but uses a type parameter to define the Go type representing the schema of rows being written. Using this type over Writer has multiple advantages: - By leveraging type information, the Go compiler can provide greater guarantees that the code is correct. For example, the parquet.Writer.Write method accepts an argument of type interface{}, which delays type checking until runtime. The parquet.GenericWriter[T].Write method ensures at compile time that the values it receives will be of type T, reducing the risk of introducing errors. - Since type information is known at compile time, the implementation of parquet.GenericWriter[T] can make safe assumptions, removing the need for runtime validation of how the parameters are passed to its methods. Optimizations relying on type information are more effective, some of the writer's state can be precomputed at initialization, which was not possible with parquet.Writer. - The parquet.GenericWriter[T].Write method uses a data-oriented design, accepting an slice of T instead of a single value, creating more opportunities to amortize the runtime cost of abstractions. This optimization is not available for parquet.Writer because its Write method's argument would be of type []interface{}, which would require conversions back and forth from concrete types to empty interfaces (since a []T cannot be interpreted as []interface{} in Go), would make the API more difficult to use and waste compute resources in the type conversions, defeating the purpose of the optimization in the first place. Note that this type is only available when compiling with Go 1.18 or later. (*GenericWriter[T]) Close() error (*GenericWriter[T]) Flush() error (*GenericWriter[T]) ReadRowsFrom(rows RowReader) (int64, error) (*GenericWriter[T]) Reset(output io.Writer) (*GenericWriter[T]) Schema() *Schema SetKeyValueMetadata sets a key/value pair in the Parquet file metadata. Keys are assumed to be unique, if the same key is repeated multiple times the last value is retained. While the parquet format does not require unique keys, this design decision was made to optimize for the most common use case where applications leverage this extension mechanism to associate single values to keys. This may create incompatibilities with other parquet libraries, or may cause some key/value pairs to be lost when open parquet files written with repeated keys. We can revisit this decision if it ever becomes a blocker. (*GenericWriter[T]) Write(rows []T) (int, error) (*GenericWriter[T]) WriteRowGroup(rowGroup RowGroup) (int64, error) (*GenericWriter[T]) WriteRows(rows []Row) (int, error) *GenericWriter : RowGroupWriter *GenericWriter : RowReaderFrom *GenericWriter : RowWriter *GenericWriter : RowWriterWithSchema *GenericWriter : github.com/apache/thrift/lib/go/thrift.Flusher *GenericWriter : github.com/polarsignals/frostdb.ParquetWriter *GenericWriter : github.com/prometheus/common/expfmt.Closer *GenericWriter : io.Closer func NewGenericWriter[T](output io.Writer, options ...WriterOption) *GenericWriter[T]
( Group) Compression() compress.Codec ( Group) Encoding() encoding.Encoding ( Group) Fields() []Field ( Group) GoType() reflect.Type ( Group) ID() int ( Group) Leaf() bool ( Group) Optional() bool ( Group) Repeated() bool ( Group) Required() bool ( Group) String() string ( Group) Type() Type Group : Node Group : expvar.Var Group : fmt.Stringer
Int32Reader is an interface implemented by ValueReader instances which expose the content of a column of int32 values. Read 32 bits integer values into the buffer passed as argument. The method returns io.EOF when all values have been read.
Int32Writer is an interface implemented by ValueWriter instances which support writing columns of 32 bits signed integer values. Write 32 bits signed integer values. The method returns the number of values written, and any error that occurred while writing the values.
Int64Reader is an interface implemented by ValueReader instances which expose the content of a column of int64 values. Read 64 bits integer values into the buffer passed as argument. The method returns io.EOF when all values have been read.
Int64Writer is an interface implemented by ValueWriter instances which support writing columns of 64 bits signed integer values. Write 64 bits signed integer values. The method returns the number of values written, and any error that occurred while writing the values.
Int96Reader is an interface implemented by ValueReader instances which expose the content of a column of int96 values. Read 96 bits integer values into the buffer passed as argument. The method returns io.EOF when all values have been read.
Int96Writer is an interface implemented by ValueWriter instances which support writing columns of 96 bits signed integer values. Write 96 bits signed integer values. The method returns the number of values written, and any error that occurred while writing the values.
Kind is an enumeration type representing the physical types supported by the parquet type system. String returns a human-readable representation of the physical type. Value constructs a value from k and v. The method panics if the data is not a valid representation of the value kind; for example, if the kind is Int32 but the data is not 4 bytes long. Kind : expvar.Var Kind : fmt.Stringer func Type.Kind() Kind func Value.Kind() Kind func NewColumnIndex(kind Kind, index *format.ColumnIndex) ColumnIndex func ZeroValue(kind Kind) Value const Boolean const ByteArray const Double const FixedLenByteArray const Float const Int32 const Int64 const Int96
LeafColumn is a struct type representing leaf columns of a parquet schema. ColumnIndex int MaxDefinitionLevel int MaxRepetitionLevel int Node Node Path []string func (*Schema).Lookup(path ...string) (LeafColumn, bool)
Node values represent nodes of a parquet schema. Nodes carry the type of values, as well as properties like whether the values are optional or repeat. Nodes with one or more children represent parquet groups and therefore do not have a logical type. Nodes are immutable values and therefore safe to use concurrently from multiple goroutines. Returns compression codec used by the node. The method may return nil to indicate that no specific compression codec was configured on the node, in which case a default compression might be used. Returns the encoding used by the node. The method may return nil to indicate that no specific encoding was configured on the node, in which case a default encoding might be used. Returns a mapping of the node's fields. As an optimization, the same slices may be returned by multiple calls to this method, programs must treat the returned values as immutable. This method returns an empty mapping when called on leaf nodes. Returns the Go type that best represents the parquet node. For leaf nodes, this will be one of bool, int32, int64, deprecated.Int96, float32, float64, string, []byte, or [N]byte. For groups, the method returns a struct type. If the method is called on a repeated node, the method returns a slice of the underlying type. For optional nodes, the method returns a pointer of the underlying type. For nodes that were constructed from Go values (e.g. using SchemaOf), the method returns the original Go type. The id of this node in its parent node. Zero value is treated as id is not set. ID only needs to be unique within its parent context. This is the same as parquet field_id Returns true if this a leaf node. Returns whether the parquet column is optional. Returns whether the parquet column is repeated. Returns whether the parquet column is required. Returns a human-readable representation of the parquet node. For leaf nodes, returns the type of values of the parquet column. Calling this method on non-leaf nodes will panic. *Column Field (interface) Group *Schema Node : expvar.Var Node : fmt.Stringer func BSON() Node func Compressed(node Node, codec compress.Codec) Node func Date() Node func Decimal(scale, precision int, typ Type) Node func Encoded(node Node, encoding encoding.Encoding) Node func Enum() Node func FieldID(node Node, id int) Node func Int(bitWidth int) Node func JSON() Node func Leaf(typ Type) Node func List(of Node) Node func Map(key, value Node) Node func Optional(node Node) Node func Repeated(node Node) Node func Required(node Node) Node func String() Node func Time(unit TimeUnit) Node func Timestamp(unit TimeUnit) Node func Uint(bitWidth int) Node func UUID() Node func Compressed(node Node, codec compress.Codec) Node func Convert(to, from Node) (conv Conversion, err error) func Encoded(node Node, encoding encoding.Encoding) Node func FieldID(node Node, id int) Node func List(of Node) Node func Map(key, value Node) Node func NewRowBuilder(schema Node) *RowBuilder func NewSchema(name string, root Node) *Schema func Optional(node Node) Node func PrintSchema(w io.Writer, name string, node Node) error func PrintSchemaIndent(w io.Writer, name string, node Node, pattern, newline string) error func Repeated(node Node) Node func Required(node Node) Node func github.com/polarsignals/frostdb/pqarrow/convert.GetWriter(offset int, n Node) (writer.NewWriterFunc, error) func github.com/polarsignals/frostdb/pqarrow/convert.ParquetNodeToType(n Node) (arrow.DataType, error)
CompressedPageSize returns the size of the page at the given index (in bytes). FirstRowIndex returns the the first row in the page at the given index. The returned row index is based on the row group that the page belongs to, the first row has index zero. NumPages returns the number of pages in the offset index. Offset returns the offset starting from the beginning of the file for the page at the given index. func ColumnBuffer.OffsetIndex() (OffsetIndex, error) func ColumnChunk.OffsetIndex() (OffsetIndex, error) func github.com/polarsignals/frostdb/dynparquet.(*NilColumnChunk).OffsetIndex() (OffsetIndex, error)
Page values represent sequences of parquet values. From the Parquet documentation: "Column chunks are a chunk of the data for a particular column. They live in a particular row group and are guaranteed to be contiguous in the file. Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk." https://github.com/apache/parquet-format#glossary Returns the page's min and max values. The third value is a boolean indicating whether the page bounds were available. Page bounds may not be known if the page contained no values or only nulls, or if they were read from a parquet file which had neither page statistics nor a page index. Returns the column index that this page belongs to. Returns the in-memory buffer holding the page values. The intent is for the returned value to be used as input parameter when calling the Encode method of the associated Type. The slices referenced by the encoding.Values may be the same across multiple calls to this method, applications must treat the content as immutable. ( Page) DefinitionLevels() []byte If the page contains indexed values, calling this method returns the dictionary in which the values are looked up. Otherwise, the method returns nil. ( Page) NumNulls() int64 Returns the number of rows, values, and nulls in the page. The number of rows may be less than the number of values in the page if the page is part of a repeated column. ( Page) NumValues() int64 Expose the lists of repetition and definition levels of the page. The returned slices may be empty when the page has no repetition or definition levels. Returns the size of the page in bytes (uncompressed). Returns a new page which is as slice of the receiver between row indexes i and j. Returns the type of values read from this page. The returned type can be used to encode the page data, in the case of an indexed page (which has a dictionary), the type is configured to encode the indexes stored in the page rather than the plain values. Returns a reader exposing the values contained in the page. Depending on the underlying implementation, the returned reader may support reading an array of typed Go values by implementing interfaces like parquet.Int32Reader. Applications should use type assertions on the returned reader to determine whether those optimizations are available. func (*Column).DecodeDataPageV1(header DataPageHeaderV1, page []byte, dict Dictionary) (Page, error) func (*Column).DecodeDataPageV2(header DataPageHeaderV2, page []byte, dict Dictionary) (Page, error) func ColumnBuffer.Page() Page func Dictionary.Page() Page func Page.Slice(i, j int64) Page func PageReader.ReadPage() (Page, error) func Pages.ReadPage() (Page, error) func Type.NewPage(columnIndex, numValues int, data encoding.Values) Page func PrintPage(w io.Writer, page Page) error func Release(page Page) func Retain(page Page) func PageWriter.WritePage(Page) (int64, error) func github.com/polarsignals/frostdb/pqarrow/writer.PageWriter.WritePage(Page) error
PageHeader is an interface implemented by parquet page headers. Returns the page encoding. Returns the number of values in the page (including nulls). Returns the parquet format page type. DataPageHeader (interface) DataPageHeaderV1 DataPageHeaderV2 DictionaryPageHeader
PageReader is an interface implemented by types that support producing a sequence of pages. Reads and returns the next page from the sequence. When all pages have been read, or if the sequence was closed, the method returns io.EOF. Pages (interface) func CopyPages(dst PageWriter, src PageReader) (numValues int64, err error)
Pages is an interface implemented by page readers returned by calling the Pages method of ColumnChunk instances. ( Pages) Close() error Reads and returns the next page from the sequence. When all pages have been read, or if the sequence was closed, the method returns io.EOF. Positions the stream on the given row index. Some implementations of the interface may only allow seeking forward. The method returns io.ErrClosedPipe if the stream had already been closed. Pages : PageReader Pages : RowSeeker Pages : github.com/prometheus/common/expfmt.Closer Pages : io.Closer func AsyncPages(pages Pages) Pages func (*Column).Pages() Pages func ColumnBuffer.Pages() Pages func ColumnChunk.Pages() Pages func github.com/polarsignals/frostdb/dynparquet.(*NilColumnChunk).Pages() Pages func AsyncPages(pages Pages) Pages
PageWriter is an interface implemented by types that support writing pages to an underlying storage medium. ( PageWriter) WritePage(Page) (int64, error) func CopyPages(dst PageWriter, src PageReader) (numValues int64, err error)
Deprecated: A Reader reads Go values from parquet files. This example showcases a typical use of parquet readers: reader := parquet.NewReader(file) rows := []RowType{} for { row := RowType{} err := reader.Read(&row) if err != nil { if err == io.EOF { break } ... } rows = append(rows, row) } if err := reader.Close(); err != nil { ... } For programs building with Go 1.18 or later, the GenericReader[T] type supersedes this one. Close closes the reader, preventing more rows from being read. NumRows returns the number of rows that can be read from r. Read reads the next row from r. The type of the row must match the schema of the underlying parquet file or an error will be returned. The method returns io.EOF when no more rows can be read from r. ReadRows reads the next rows from r into the given Row buffer. The returned values are laid out in the order expected by the parquet.(*Schema).Reconstruct method. The method returns io.EOF when no more rows can be read from r. Reset repositions the reader at the beginning of the underlying parquet file. Schema returns the schema of rows read by r. SeekToRow positions r at the given row index. *Reader : RowReader *Reader : RowReaderWithSchema *Reader : RowReadSeeker *Reader : RowSeeker *Reader : Rows *Reader : github.com/prometheus/common/expfmt.Closer *Reader : io.Closer func NewReader(input io.ReaderAt, options ...ReaderOption) *Reader func NewRowGroupReader(rowGroup RowGroup, options ...ReaderOption) *Reader
The ReaderConfig type carries configuration options for parquet readers. ReaderConfig implements the ReaderOption interface so it can be used directly as argument to the NewReader function when needed, for example: reader := parquet.NewReader(output, schema, &parquet.ReaderConfig{ // ... }) Schema *Schema Apply applies the given list of options to c. ConfigureReader applies configuration options from c to config. Validate returns a non-nil error if the configuration of c is invalid. *ReaderConfig : ReaderOption func DefaultReaderConfig() *ReaderConfig func NewReaderConfig(options ...ReaderOption) (*ReaderConfig, error) func (*ReaderConfig).ConfigureReader(config *ReaderConfig) func ReaderOption.ConfigureReader(*ReaderConfig) func (*Schema).ConfigureReader(config *ReaderConfig)
ReaderOption is an interface implemented by types that carry configuration options for parquet readers. ( ReaderOption) ConfigureReader(*ReaderConfig) *ReaderConfig *Schema func NewGenericReader[T](input io.ReaderAt, options ...ReaderOption) *GenericReader[T] func NewGenericRowGroupReader[T](rowGroup RowGroup, options ...ReaderOption) *GenericReader[T] func NewReader(input io.ReaderAt, options ...ReaderOption) *Reader func NewReaderConfig(options ...ReaderOption) (*ReaderConfig, error) func NewRowGroupReader(rowGroup RowGroup, options ...ReaderOption) *Reader func Read[T](r io.ReaderAt, size int64, options ...ReaderOption) (rows []T, err error) func ReadFile[T](path string, options ...ReaderOption) (rows []T, err error) func (*ReaderConfig).Apply(options ...ReaderOption)
ReadMode is an enum that is used to configure the way that a File reads pages. func FileReadMode(mode ReadMode) FileOption const DefaultReadMode const ReadModeAsync const ReadModeSync
Row represents a parquet row as a slice of values. Each value should embed a column index, repetition level, and definition level allowing the program to determine how to reconstruct the original object from the row. Clone creates a copy of the row which shares no pointers. This method is useful to capture rows after a call to RowReader.ReadRows when values need to be retained before the next call to ReadRows or after the lifespan of the reader. Equal returns true if row and other contain the same sequence of values. Range calls f for each column of row. func AppendRow(row Row, columns ...[]Value) Row func MakeRow(columns ...[]Value) Row func Row.Clone() Row func (*RowBuilder).AppendRow(row Row) Row func (*RowBuilder).Row() Row func (*Schema).Deconstruct(row Row, value interface{}) Row func github.com/polarsignals/frostdb/pqarrow.RecordToRow(final *Schema, record arrow.Record, index int) (Row, error) func github.com/polarsignals/frostdb/samples.Sample.ToParquetRow(labelNames []string) Row func AppendRow(row Row, columns ...[]Value) Row func (*Buffer).WriteRows(rows []Row) (int, error) func Conversion.Convert(rows []Row) (int, error) func (*GenericBuffer)[T].WriteRows(rows []Row) (int, error) func (*GenericReader)[T].ReadRows(rows []Row) (int, error) func (*GenericWriter)[T].WriteRows(rows []Row) (int, error) func (*Reader).ReadRows(rows []Row) (int, error) func Row.Equal(other Row) bool func (*RowBuffer)[T].WriteRows(rows []Row) (int, error) func (*RowBuilder).AppendRow(row Row) Row func RowReader.ReadRows([]Row) (int, error) func RowReaderFunc.ReadRows(rows []Row) (int, error) func RowReaderWithSchema.ReadRows([]Row) (int, error) func RowReadSeeker.ReadRows([]Row) (int, error) func Rows.ReadRows([]Row) (int, error) func RowWriter.WriteRows([]Row) (int, error) func RowWriterFunc.WriteRows(rows []Row) (int, error) func RowWriterWithSchema.WriteRows([]Row) (int, error) func (*Schema).Deconstruct(row Row, value interface{}) Row func (*Schema).Reconstruct(value interface{}, row Row) error func (*SortingWriter)[T].WriteRows(rows []Row) (int, error) func (*Writer).WriteRows(rows []Row) (int, error) func github.com/polarsignals/frostdb.ParquetWriter.WriteRows([]Row) (int, error) func github.com/polarsignals/frostdb/dynparquet.NewDynamicRow(row Row, schema *Schema, dyncols map[string][]string, fields []Field) *dynparquet.DynamicRow func github.com/polarsignals/frostdb/dynparquet.NewDynamicRows(rows []Row, schema *Schema, dynamicColumns map[string][]string, fields []Field) *dynparquet.DynamicRows func github.com/polarsignals/frostdb/dynparquet.ValuesForIndex(row Row, index int) []Value func github.com/polarsignals/frostdb/dynparquet.(*Buffer).WriteRows(rows []Row) (int, error) func github.com/polarsignals/frostdb/dynparquet.ParquetWriter.WriteRows(rows []Row) (int, error)
Type Parameters: T: any RowBuffer is an implementation of the RowGroup interface which stores parquet rows in memory. Unlike GenericBuffer which uses a column layout to store values in memory buffers, RowBuffer uses a row layout. The use of row layout provides greater efficiency when sorting the buffer, which is the primary use case for the RowBuffer type. Applications which intend to sort rows prior to writing them to a parquet file will often see lower CPU utilization from using a RowBuffer than a GenericBuffer. RowBuffer values are not safe to use concurrently from multiple goroutines. ColumnChunks returns a view of the buffer's columns. Note that reading columns of a RowBuffer will be less efficient than reading columns of a GenericBuffer since the latter uses a column layout. This method is mainly exposed to satisfy the RowGroup interface, applications which need compute-efficient column scans on in-memory buffers should likely use a GenericBuffer instead. The returned column chunks are snapshots at the time the method is called, they remain valid until the next call to Reset on the buffer. Len returns the number of rows in the buffer. The method contributes to satisfying sort.Interface. Less compares the rows at index i and j according to the sorting columns configured on the buffer. The method contributes to satisfying sort.Interface. NumRows returns the number of rows currently written to the buffer. Reset clears the content of the buffer without releasing its memory. Rows returns a Rows instance exposing rows stored in the buffer. The rows returned are a snapshot at the time the method is called. The returned rows and values read from it remain valid until the next call to Reset on the buffer. Schema returns the schema of rows in the buffer. SortingColumns returns the list of columns that rows are expected to be sorted by. The list of sorting columns is configured when the buffer is created and used when it is sorted. Note that unless the buffer is explicitly sorted, there are no guarantees that the rows it contains will be in the order specified by the sorting columns. Swap exchanges the rows at index i and j in the buffer. The method contributes to satisfying sort.Interface. Write writes rows to the buffer, returning the number of rows written. WriteRows writes parquet rows to the buffer, returing the number of rows written. *RowBuffer : RowGroup *RowBuffer : RowWriter *RowBuffer : RowWriterWithSchema *RowBuffer : github.com/polarsignals/frostdb/query/expr.Particulate *RowBuffer : sort.Interface func NewRowBuffer[T](options ...RowGroupOption) *RowBuffer[T]
RowBuilder is a type which helps build parquet rows incrementally by adding values to columns. Add adds columnValue to the column at columnIndex. AppendRow appends the current state of b to row and returns it. Next must be called to indicate the start of a new repeated record for the column at the given index. If the column index is part of a repeated group, the builder automatically starts a new record for all adjacent columns, the application does not need to call this method for each column of the repeated group. Next must be called after adding a sequence of records. Reset clears the internal state of b, making it possible to reuse while retaining the internal buffers. Row materializes the current state of b into a parquet row. func NewRowBuilder(schema Node) *RowBuilder
RowGroup is an interface representing a parquet row group. From the Parquet docs, a RowGroup is "a logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset." https://github.com/apache/parquet-format#glossary Returns the list of column chunks in this row group. The chunks are ordered in the order of leaf columns from the row group's schema. If the underlying implementation is not read-only, the returned parquet.ColumnChunk may implement other interfaces: for example, parquet.ColumnBuffer if the chunk is backed by an in-memory buffer, or typed writer interfaces like parquet.Int32Writer depending on the underlying type of values that can be written to the chunk. As an optimization, the row group may return the same slice across multiple calls to this method. Applications should treat the returned slice as read-only. Returns the number of rows in the group. Returns a reader exposing the rows of the row group. As an optimization, the returned parquet.Rows object may implement parquet.RowWriterTo, and test the RowWriter it receives for an implementation of the parquet.RowGroupWriter interface. This optimization mechanism is leveraged by the parquet.CopyRows function to skip the generic row-by-row copy algorithm and delegate the copy logic to the parquet.Rows object. Returns the schema of rows in the group. Returns the list of sorting columns describing how rows are sorted in the group. The method will return an empty slice if the rows are not sorted. *Buffer *GenericBuffer[...] *RowBuffer[...] *github.com/polarsignals/frostdb/dynparquet.Buffer github.com/polarsignals/frostdb/dynparquet.DynamicRowGroup (interface) *github.com/polarsignals/frostdb/dynparquet.DynamicRowGroupMergeAdapter github.com/polarsignals/frostdb/dynparquet.MergedRowGroup github.com/polarsignals/frostdb/dynparquet.PooledBuffer github.com/polarsignals/frostdb/index.ReleaseableRowGroup (interface) RowGroup : github.com/polarsignals/frostdb/query/expr.Particulate func ConvertRowGroup(rowGroup RowGroup, conv Conversion) RowGroup func MergeRowGroups(rowGroups []RowGroup, options ...RowGroupOption) (RowGroup, error) func MultiRowGroup(rowGroups ...RowGroup) RowGroup func (*File).RowGroups() []RowGroup func RowGroupReader.ReadRowGroup() (RowGroup, error) func ConvertRowGroup(rowGroup RowGroup, conv Conversion) RowGroup func MergeRowGroups(rowGroups []RowGroup, options ...RowGroupOption) (RowGroup, error) func MultiRowGroup(rowGroups ...RowGroup) RowGroup func NewGenericRowGroupReader[T](rowGroup RowGroup, options ...ReaderOption) *GenericReader[T] func NewRowGroupReader(rowGroup RowGroup, options ...ReaderOption) *Reader func NewRowGroupRowReader(rowGroup RowGroup) Rows func PrintRowGroup(w io.Writer, rowGroup RowGroup) error func (*Buffer).WriteRowGroup(rowGroup RowGroup) (int64, error) func (*GenericBuffer)[T].WriteRowGroup(rowGroup RowGroup) (int64, error) func (*GenericWriter)[T].WriteRowGroup(rowGroup RowGroup) (int64, error) func RowGroupWriter.WriteRowGroup(RowGroup) (int64, error) func (*Writer).WriteRowGroup(rowGroup RowGroup) (int64, error) func github.com/polarsignals/frostdb/dynparquet.NewDynamicRowGroupMergeAdapter(schema *Schema, sortingColumns []SortingColumn, mergedDynamicColumns map[string][]string, originalRowGroup RowGroup) *dynparquet.DynamicRowGroupMergeAdapter func github.com/polarsignals/frostdb/dynparquet.(*Buffer).WriteRowGroup(rg RowGroup) (int64, error) func github.com/polarsignals/frostdb/pqarrow.ParquetRowGroupToArrowSchema(ctx context.Context, rg RowGroup, s *dynparquet.Schema, options logicalplan.IterOptions) (*arrow.Schema, error) func github.com/polarsignals/frostdb/pqarrow.(*ParquetConverter).Convert(ctx context.Context, rg RowGroup, s *dynparquet.Schema) error
The RowGroupConfig type carries configuration options for parquet row groups. RowGroupConfig implements the RowGroupOption interface so it can be used directly as argument to the NewBuffer function when needed, for example: buffer := parquet.NewBuffer(&parquet.RowGroupConfig{ ColumnBufferCapacity: 10_000, }) ColumnBufferCapacity int Schema *Schema Sorting SortingConfig (*RowGroupConfig) Apply(options ...RowGroupOption) (*RowGroupConfig) ConfigureRowGroup(config *RowGroupConfig) Validate returns a non-nil error if the configuration of c is invalid. *RowGroupConfig : RowGroupOption func DefaultRowGroupConfig() *RowGroupConfig func NewRowGroupConfig(options ...RowGroupOption) (*RowGroupConfig, error) func (*RowGroupConfig).ConfigureRowGroup(config *RowGroupConfig) func RowGroupOption.ConfigureRowGroup(*RowGroupConfig) func (*Schema).ConfigureRowGroup(config *RowGroupConfig)
RowGroupOption is an interface implemented by types that carry configuration options for parquet row groups. ( RowGroupOption) ConfigureRowGroup(*RowGroupConfig) *RowGroupConfig *Schema func ColumnBufferCapacity(size int) RowGroupOption func SortingRowGroupConfig(options ...SortingOption) RowGroupOption func MergeRowGroups(rowGroups []RowGroup, options ...RowGroupOption) (RowGroup, error) func NewBuffer(options ...RowGroupOption) *Buffer func NewGenericBuffer[T](options ...RowGroupOption) *GenericBuffer[T] func NewRowBuffer[T](options ...RowGroupOption) *RowBuffer[T] func NewRowGroupConfig(options ...RowGroupOption) (*RowGroupConfig, error) func (*RowGroupConfig).Apply(options ...RowGroupOption)
RowGroupReader is an interface implemented by types that expose sequences of row groups to the application. ( RowGroupReader) ReadRowGroup() (RowGroup, error)
RowGroupWriter is an interface implemented by types that allow the program to write row groups. ( RowGroupWriter) WriteRowGroup(RowGroup) (int64, error) *Buffer *GenericBuffer[...] *GenericWriter[...] *Writer *github.com/polarsignals/frostdb/dynparquet.Buffer github.com/polarsignals/frostdb/dynparquet.PooledBuffer
RowReader reads a sequence of parquet rows. ReadRows reads rows from the reader, returning the number of rows read into the buffer, and any error that occurred. Note that the rows read into the buffer are not safe for reuse after a subsequent call to ReadRows. Callers that want to reuse rows must copy the rows using Clone. When all rows have been read, the reader returns io.EOF to indicate the end of the sequence. It is valid for the reader to return both a non-zero number of rows and a non-nil error (including io.EOF). The buffer of rows passed as argument will be used to store values of each row read from the reader. If the rows are not nil, the backing array of the slices will be used as an optimization to avoid re-allocating new arrays. The application is expected to handle the case where ReadRows returns less rows than requested and no error, by looking at the first returned value from ReadRows, which is the number of rows that were read. *GenericReader[...] *Reader RowReaderFunc RowReaderWithSchema (interface) RowReadSeeker (interface) Rows (interface) func DedupeRowReader(reader RowReader, compare func(Row, Row) int) RowReader func FilterRowReader(reader RowReader, predicate func(Row) bool) RowReader func MergeRowReaders(readers []RowReader, compare func(Row, Row) int) RowReader func ScanRowReader(reader RowReader, predicate func(Row, int64) bool) RowReader func TransformRowReader(reader RowReader, transform func(dst, src Row) (Row, error)) RowReader func ConvertRowReader(rows RowReader, conv Conversion) RowReaderWithSchema func CopyRows(dst RowWriter, src RowReader) (int64, error) func DedupeRowReader(reader RowReader, compare func(Row, Row) int) RowReader func FilterRowReader(reader RowReader, predicate func(Row) bool) RowReader func MergeRowReaders(readers []RowReader, compare func(Row, Row) int) RowReader func ScanRowReader(reader RowReader, predicate func(Row, int64) bool) RowReader func TransformRowReader(reader RowReader, transform func(dst, src Row) (Row, error)) RowReader func (*GenericWriter)[T].ReadRowsFrom(rows RowReader) (int64, error) func RowReaderFrom.ReadRowsFrom(RowReader) (int64, error) func (*Writer).ReadRowsFrom(rows RowReader) (written int64, err error)
RowReaderFrom reads parquet rows from reader. ( RowReaderFrom) ReadRowsFrom(RowReader) (int64, error) *GenericWriter[...] *Writer
RowReaderFunc is a function type implementing the RowReader interface. ( RowReaderFunc) ReadRows(rows []Row) (int, error) RowReaderFunc : RowReader
RowReaderWithSchema is an extension of the RowReader interface which advertises the schema of rows returned by ReadRow calls. ReadRows reads rows from the reader, returning the number of rows read into the buffer, and any error that occurred. Note that the rows read into the buffer are not safe for reuse after a subsequent call to ReadRows. Callers that want to reuse rows must copy the rows using Clone. When all rows have been read, the reader returns io.EOF to indicate the end of the sequence. It is valid for the reader to return both a non-zero number of rows and a non-nil error (including io.EOF). The buffer of rows passed as argument will be used to store values of each row read from the reader. If the rows are not nil, the backing array of the slices will be used as an optimization to avoid re-allocating new arrays. The application is expected to handle the case where ReadRows returns less rows than requested and no error, by looking at the first returned value from ReadRows, which is the number of rows that were read. ( RowReaderWithSchema) Schema() *Schema *GenericReader[...] *Reader Rows (interface) RowReaderWithSchema : RowReader func ConvertRowReader(rows RowReader, conv Conversion) RowReaderWithSchema
RowReadSeeker is an interface implemented by row readers which support seeking to arbitrary row positions. ReadRows reads rows from the reader, returning the number of rows read into the buffer, and any error that occurred. Note that the rows read into the buffer are not safe for reuse after a subsequent call to ReadRows. Callers that want to reuse rows must copy the rows using Clone. When all rows have been read, the reader returns io.EOF to indicate the end of the sequence. It is valid for the reader to return both a non-zero number of rows and a non-nil error (including io.EOF). The buffer of rows passed as argument will be used to store values of each row read from the reader. If the rows are not nil, the backing array of the slices will be used as an optimization to avoid re-allocating new arrays. The application is expected to handle the case where ReadRows returns less rows than requested and no error, by looking at the first returned value from ReadRows, which is the number of rows that were read. Positions the stream on the given row index. Some implementations of the interface may only allow seeking forward. The method returns io.ErrClosedPipe if the stream had already been closed. *GenericReader[...] *Reader Rows (interface) RowReadSeeker : RowReader RowReadSeeker : RowSeeker
Rows is an interface implemented by row readers returned by calling the Rows method of RowGroup instances. Applications should call Close when they are done using a Rows instance in order to release the underlying resources held by the row sequence. After calling Close, all attempts to read more rows will return io.EOF. ( Rows) Close() error ReadRows reads rows from the reader, returning the number of rows read into the buffer, and any error that occurred. Note that the rows read into the buffer are not safe for reuse after a subsequent call to ReadRows. Callers that want to reuse rows must copy the rows using Clone. When all rows have been read, the reader returns io.EOF to indicate the end of the sequence. It is valid for the reader to return both a non-zero number of rows and a non-nil error (including io.EOF). The buffer of rows passed as argument will be used to store values of each row read from the reader. If the rows are not nil, the backing array of the slices will be used as an optimization to avoid re-allocating new arrays. The application is expected to handle the case where ReadRows returns less rows than requested and no error, by looking at the first returned value from ReadRows, which is the number of rows that were read. ( Rows) Schema() *Schema Positions the stream on the given row index. Some implementations of the interface may only allow seeking forward. The method returns io.ErrClosedPipe if the stream had already been closed. *GenericReader[...] *Reader Rows : RowReader Rows : RowReaderWithSchema Rows : RowReadSeeker Rows : RowSeeker Rows : github.com/prometheus/common/expfmt.Closer Rows : io.Closer func NewRowGroupRowReader(rowGroup RowGroup) Rows func (*Buffer).Rows() Rows func (*GenericBuffer)[T].Rows() Rows func (*RowBuffer)[T].Rows() Rows func RowGroup.Rows() Rows func github.com/polarsignals/frostdb/dynparquet.(*Buffer).Rows() Rows func github.com/polarsignals/frostdb/dynparquet.DynamicRowGroup.Rows() Rows func github.com/polarsignals/frostdb/dynparquet.(*DynamicRowGroupMergeAdapter).Rows() Rows func github.com/polarsignals/frostdb/index.ReleaseableRowGroup.Rows() Rows
RowSeeker is an interface implemented by readers of parquet rows which can be positioned at a specific row index. Positions the stream on the given row index. Some implementations of the interface may only allow seeking forward. The method returns io.ErrClosedPipe if the stream had already been closed. *GenericReader[...] Pages (interface) *Reader RowReadSeeker (interface) Rows (interface) github.com/polarsignals/frostdb/dynparquet.DynamicRowReader (interface)
RowWriter writes parquet rows to an underlying medium. Writes rows to the writer, returning the number of rows written and any error that occurred. Because columnar operations operate on independent columns of values, writes of rows may not be atomic operations, and could result in some rows being partially written. The method returns the number of rows that were successfully written, but if an error occurs, values of the row(s) that failed to be written may have been partially committed to their columns. For that reason, applications should consider a write error as fatal and assume that they need to discard the state, they cannot retry the write nor recover the underlying file. *Buffer *GenericBuffer[...] *GenericWriter[...] *RowBuffer[...] RowWriterFunc RowWriterWithSchema (interface) *SortingWriter[...] *Writer github.com/polarsignals/frostdb.ParquetWriter (interface) *github.com/polarsignals/frostdb/dynparquet.Buffer github.com/polarsignals/frostdb/dynparquet.ParquetWriter (interface) github.com/polarsignals/frostdb/dynparquet.PooledBuffer github.com/polarsignals/frostdb/dynparquet.PooledWriter func DedupeRowWriter(writer RowWriter, compare func(Row, Row) int) RowWriter func FilterRowWriter(writer RowWriter, predicate func(Row) bool) RowWriter func MultiRowWriter(writers ...RowWriter) RowWriter func TransformRowWriter(writer RowWriter, transform func(dst, src Row) (Row, error)) RowWriter func CopyRows(dst RowWriter, src RowReader) (int64, error) func DedupeRowWriter(writer RowWriter, compare func(Row, Row) int) RowWriter func FilterRowWriter(writer RowWriter, predicate func(Row) bool) RowWriter func MultiRowWriter(writers ...RowWriter) RowWriter func TransformRowWriter(writer RowWriter, transform func(dst, src Row) (Row, error)) RowWriter func RowWriterTo.WriteRowsTo(RowWriter) (int64, error)
RowWriterFunc is a function type implementing the RowWriter interface. ( RowWriterFunc) WriteRows(rows []Row) (int, error) RowWriterFunc : RowWriter
RowWriterTo writes parquet rows to a writer. ( RowWriterTo) WriteRowsTo(RowWriter) (int64, error)
RowWriterWithSchema is an extension of the RowWriter interface which advertises the schema of rows expected to be passed to WriteRow calls. ( RowWriterWithSchema) Schema() *Schema Writes rows to the writer, returning the number of rows written and any error that occurred. Because columnar operations operate on independent columns of values, writes of rows may not be atomic operations, and could result in some rows being partially written. The method returns the number of rows that were successfully written, but if an error occurs, values of the row(s) that failed to be written may have been partially committed to their columns. For that reason, applications should consider a write error as fatal and assume that they need to discard the state, they cannot retry the write nor recover the underlying file. *Buffer *GenericBuffer[...] *GenericWriter[...] *RowBuffer[...] *SortingWriter[...] *Writer *github.com/polarsignals/frostdb/dynparquet.Buffer github.com/polarsignals/frostdb/dynparquet.ParquetWriter (interface) github.com/polarsignals/frostdb/dynparquet.PooledBuffer github.com/polarsignals/frostdb/dynparquet.PooledWriter RowWriterWithSchema : RowWriter
Schema represents a parquet schema created from a Go value. Schema implements the Node interface to represent the root node of a parquet schema. Columns returns the list of column paths available in the schema. The method always returns the same slice value across calls to ColumnPaths, applications should treat it as immutable. Comparator constructs a comparator function which orders rows according to the list of sorting columns passed as arguments. Compression returns the compression codec set on the root node of the parquet schema. ConfigureReader satisfies the ReaderOption interface, allowing Schema instances to be passed to NewReader to pre-declare the schema of rows read from the reader. ConfigureRowGroup satisfies the RowGroupOption interface, allowing Schema instances to be passed to row group constructors to pre-declare the schema of the output parquet file. ConfigureWriter satisfies the WriterOption interface, allowing Schema instances to be passed to NewWriter to pre-declare the schema of the output parquet file. Deconstruct deconstructs a Go value and appends it to a row. The method panics is the structure of the go value does not match the parquet schema. Encoding returns the encoding set on the root node of the parquet schema. Fields returns the list of fields on the root node of the parquet schema. GoType returns the Go type that best represents the schema. ID returns field id of the root node. Leaf returns true if the root node of the parquet schema is a leaf column. Lookup returns the leaf column at the given path. The path is the sequence of column names identifying a leaf column (not including the root). If the path was not found in the mapping, or if it did not represent a leaf column of the parquet schema, the boolean will be false. Name returns the name of s. Optional returns false since the root node of a parquet schema is always required. Reconstruct reconstructs a Go value from a row. The go value passed as first argument must be a non-nil pointer for the row to be decoded into. The method panics if the structure of the go value and parquet row do not match. Repeated returns false since the root node of a parquet schema is always required. Required returns true since the root node of a parquet schema is always required. String returns a parquet schema representation of s. Type returns the parquet type of s. *Schema : Node *Schema : ReaderOption *Schema : RowGroupOption *Schema : WriterOption *Schema : github.com/polarsignals/frostdb/query/logicalplan.Named *Schema : expvar.Var *Schema : fmt.Stringer func NewSchema(name string, root Node) *Schema func SchemaOf(model interface{}) *Schema func (*Buffer).Schema() *Schema func Conversion.Schema() *Schema func (*File).Schema() *Schema func (*GenericBuffer)[T].Schema() *Schema func (*GenericReader)[T].Schema() *Schema func (*GenericWriter)[T].Schema() *Schema func (*Reader).Schema() *Schema func (*RowBuffer)[T].Schema() *Schema func RowGroup.Schema() *Schema func RowReaderWithSchema.Schema() *Schema func Rows.Schema() *Schema func RowWriterWithSchema.Schema() *Schema func (*SortingWriter)[T].Schema() *Schema func (*Writer).Schema() *Schema func github.com/polarsignals/frostdb/dynparquet.ParquetSchemaFromV2Definition(def *schemav2pb.Schema) *Schema func github.com/polarsignals/frostdb/dynparquet.(*Buffer).Schema() *Schema func github.com/polarsignals/frostdb/dynparquet.DynamicRowGroup.Schema() *Schema func github.com/polarsignals/frostdb/dynparquet.(*DynamicRowGroupMergeAdapter).Schema() *Schema func github.com/polarsignals/frostdb/dynparquet.ParquetWriter.Schema() *Schema func github.com/polarsignals/frostdb/dynparquet.(*Schema).ParquetSchema() *Schema func github.com/polarsignals/frostdb/index.ReleaseableRowGroup.Schema() *Schema func github.com/polarsignals/frostdb/query/expr.Particulate.Schema() *Schema func FileSchema(schema *Schema) FileOption func github.com/polarsignals/frostdb/dynparquet.NewDynamicRow(row Row, schema *Schema, dyncols map[string][]string, fields []Field) *dynparquet.DynamicRow func github.com/polarsignals/frostdb/dynparquet.NewDynamicRowGroupMergeAdapter(schema *Schema, sortingColumns []SortingColumn, mergedDynamicColumns map[string][]string, originalRowGroup RowGroup) *dynparquet.DynamicRowGroupMergeAdapter func github.com/polarsignals/frostdb/dynparquet.NewDynamicRows(rows []Row, schema *Schema, dynamicColumns map[string][]string, fields []Field) *dynparquet.DynamicRows func github.com/polarsignals/frostdb/pqarrow.ParquetSchemaToArrowSchema(ctx context.Context, schema *Schema, s *dynparquet.Schema, options logicalplan.IterOptions) (*arrow.Schema, error) func github.com/polarsignals/frostdb/pqarrow.RecordToDynamicRow(pqSchema *Schema, record arrow.Record, dyncols map[string][]string, index int) (*dynparquet.DynamicRow, error) func github.com/polarsignals/frostdb/pqarrow.RecordToRow(final *Schema, record arrow.Record, index int) (Row, error)
SortingColumn represents a column by which a row group is sorted. Returns true if the column will sort values in descending order. Returns true if the column will put null values at the beginning. Returns the path of the column in the row group schema, omitting the name of the root node. github.com/polarsignals/frostdb/dynparquet.SortingColumn (interface) func Ascending(path ...string) SortingColumn func Descending(path ...string) SortingColumn func NullsFirst(sortingColumn SortingColumn) SortingColumn func (*Buffer).SortingColumns() []SortingColumn func (*GenericBuffer)[T].SortingColumns() []SortingColumn func (*RowBuffer)[T].SortingColumns() []SortingColumn func RowGroup.SortingColumns() []SortingColumn func github.com/polarsignals/frostdb/dynparquet.SortingColumnsFromDef(def *schemav2pb.Schema) ([]SortingColumn, error) func github.com/polarsignals/frostdb/dynparquet.(*Buffer).SortingColumns() []SortingColumn func github.com/polarsignals/frostdb/dynparquet.DynamicRowGroup.SortingColumns() []SortingColumn func github.com/polarsignals/frostdb/dynparquet.(*DynamicRowGroupMergeAdapter).SortingColumns() []SortingColumn func github.com/polarsignals/frostdb/dynparquet.Schema.ParquetSortingColumns(dynamicColumns map[string][]string) []SortingColumn func github.com/polarsignals/frostdb/index.ReleaseableRowGroup.SortingColumns() []SortingColumn func NullsFirst(sortingColumn SortingColumn) SortingColumn func SortingColumns(columns ...SortingColumn) SortingOption func (*Schema).Comparator(sortingColumns ...SortingColumn) func(Row, Row) int func github.com/polarsignals/frostdb/dynparquet.NewDynamicRowGroupMergeAdapter(schema *Schema, sortingColumns []SortingColumn, mergedDynamicColumns map[string][]string, originalRowGroup RowGroup) *dynparquet.DynamicRowGroupMergeAdapter
The SortingConfig type carries configuration options for parquet row groups. SortingConfig implements the SortingOption interface so it can be used directly as argument to the NewSortingWriter function when needed, for example: buffer := parquet.NewSortingWriter[Row]( parquet.SortingWriterConfig( parquet.DropDuplicatedRows(true), ), }) DropDuplicatedRows bool SortingBuffers BufferPool SortingColumns []SortingColumn (*SortingConfig) Apply(options ...SortingOption) (*SortingConfig) ConfigureSorting(config *SortingConfig) (*SortingConfig) Validate() error *SortingConfig : SortingOption func DefaultSortingConfig() *SortingConfig func NewSortingConfig(options ...SortingOption) (*SortingConfig, error) func (*SortingConfig).ConfigureSorting(config *SortingConfig) func SortingOption.ConfigureSorting(*SortingConfig)
SortingOption is an interface implemented by types that carry configuration options for parquet sorting writers. ( SortingOption) ConfigureSorting(*SortingConfig) *SortingConfig func DropDuplicatedRows(drop bool) SortingOption func SortingBuffers(buffers BufferPool) SortingOption func SortingColumns(columns ...SortingColumn) SortingOption func NewSortingConfig(options ...SortingOption) (*SortingConfig, error) func SortingRowGroupConfig(options ...SortingOption) RowGroupOption func SortingWriterConfig(options ...SortingOption) WriterOption func (*SortingConfig).Apply(options ...SortingOption)
Type Parameters: T: any SortingWriter is a type similar to GenericWriter but it ensures that rows are sorted according to the sorting columns configured on the writer. The writer accumulates rows in an in-memory buffer which is sorted when it reaches the target number of rows, then written to a temporary row group. When the writer is flushed or closed, the temporary row groups are merged into a row group in the output file, ensuring that rows remain sorted in the final row group. Because row groups get encoded and compressed, they hold a lot less memory than if all rows were retained in memory. Sorting then merging rows chunks also tends to be a lot more efficient than sorting all rows in memory as it results in better CPU cache utilization since sorting multi-megabyte arrays causes a lot of cache misses since the data set cannot be held in CPU caches. (*SortingWriter[T]) Close() error (*SortingWriter[T]) Flush() error (*SortingWriter[T]) Reset(output io.Writer) (*SortingWriter[T]) Schema() *Schema (*SortingWriter[T]) SetKeyValueMetadata(key, value string) (*SortingWriter[T]) Write(rows []T) (int, error) (*SortingWriter[T]) WriteRows(rows []Row) (int, error) *SortingWriter : RowWriter *SortingWriter : RowWriterWithSchema *SortingWriter : github.com/apache/thrift/lib/go/thrift.Flusher *SortingWriter : github.com/polarsignals/frostdb.ParquetWriter *SortingWriter : github.com/prometheus/common/expfmt.Closer *SortingWriter : io.Closer func NewSortingWriter[T](output io.Writer, sortRowCount int64, options ...WriterOption) *SortingWriter[T]
TimeUnit represents units of time in the parquet type system. Returns the precision of the time unit as a time.Duration value. Converts the TimeUnit value to its representation in the parquet thrift format. func Time(unit TimeUnit) Node func Timestamp(unit TimeUnit) Node var Microsecond var Millisecond var Nanosecond
The Type interface represents logical types of the parquet type system. Types are immutable and therefore safe to access from multiple goroutines. Assigns a Parquet value to a Go value. Returns an error if assignment is not possible. The source Value must be an expected logical type for the receiver. This can be accomplished using ConvertValue. ColumnOrder returns the type's column order. For group types, this method returns nil. The order describes the comparison logic implemented by the Less method. As an optimization, the method may return the same pointer across multiple calls. Applications must treat the returned value as immutable, mutating the value will result in undefined behavior. Compares two values and returns a negative integer if a < b, positive if a > b, or zero if a == b. The values' Kind must match the type, otherwise the result is undefined. The method panics if it is called on a group type. Convert a Parquet Value of the given Type into a Parquet Value that is compatible with the receiver. The returned Value is suitable to be passed to AssignValue. Returns the logical type's equivalent converted type. When there are no equivalent converted type, the method returns nil. As an optimization, the method may return the same pointer across multiple calls. Applications must treat the returned value as immutable, mutating the value will result in undefined behavior. Assuming the src buffer contains values encoding in the given encoding, decodes the input and produces the encoded values into the dst output buffer passed as first argument by dispatching the call to one of the encoding methods. Assuming the src buffer contains PLAIN encoded values of the type it is called on, applies the given encoding and produces the output to the dst buffer passed as first argument by dispatching the call to one of the encoding methods. Returns an estimation of the output size after decoding the values passed as first argument with the given encoding. For most types, this is similar to calling EstimateSize with the known number of encoded values. For variable size types, using this method may provide a more precise result since it can inspect the input buffer. Returns an estimation of the number of values of this type that can be held in the given byte size. The method returns zero for group types. Returns an estimation of the number of bytes required to hold the given number of values of this type in memory. The method returns zero for group types. Returns the Kind value representing the underlying physical type. The method panics if it is called on a group type. For integer and floating point physical types, the method returns the size of values in bits. For fixed-length byte arrays, the method returns the size of elements in bytes. For other types, the value is zero. Returns the logical type as a *format.LogicalType value. When the logical type is unknown, the method returns nil. As an optimization, the method may return the same pointer across multiple calls. Applications must treat the returned value as immutable, mutating the value will result in undefined behavior. Creates a row group buffer column for values of this type. Column buffers are created using the index of the column they are accumulating values in memory for (relative to the parent schema), and the size of their memory buffer. The application may give an estimate of the number of values it expects to write to the buffer as second argument. This estimate helps set the initialize buffer capacity but is not a hard limit, the underlying memory buffer will grown as needed to allow more values to be written. Programs may use the Size method of the column buffer (or the parent row group, when relevant) to determine how many bytes are being used, and perform a flush of the buffers to a storage layer. The method panics if it is called on a group type. Creates a column indexer for values of this type. The size limit is a hint to the column indexer that it is allowed to truncate the page boundaries to the given size. Only BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY types currently take this value into account. A value of zero or less means no limits. The method panics if it is called on a group type. Creates a dictionary holding values of this type. The dictionary retains the data buffer, it does not make a copy of it. If the application needs to share ownership of the memory buffer, it must ensure that it will not be modified while the page is in use, or it must make a copy of it prior to creating the dictionary. The method panics if the data type does not correspond to the parquet type it is called on. Creates a page belonging to a column at the given index, backed by the data buffer. The page retains the data buffer, it does not make a copy of it. If the application needs to share ownership of the memory buffer, it must ensure that it will not be modified while the page is in use, or it must make a copy of it prior to creating the page. The method panics if the data type does not correspond to the parquet type it is called on. Creates an encoding.Values instance backed by the given buffers. The offsets is only used by BYTE_ARRAY types, where it represents the positions of each variable length value in the values buffer. The following expression creates an empty instance for any type: values := typ.NewValues(nil, nil) The method panics if it is called on group types. Returns the physical type as a *format.Type value. For group types, this method returns nil. As an optimization, the method may return the same pointer across multiple calls. Applications must treat the returned value as immutable, mutating the value will result in undefined behavior. Returns a human-readable representation of the parquet type. Type : expvar.Var Type : fmt.Stringer func FixedLenByteArrayType(length int) Type func (*Column).Type() Type func ColumnBuffer.Type() Type func ColumnChunk.Type() Type func Dictionary.Type() Type func Field.Type() Type func Group.Type() Type func Node.Type() Type func Page.Type() Type func (*Schema).Type() Type func github.com/polarsignals/frostdb/dynparquet.(*NilColumnChunk).Type() Type func Decimal(scale, precision int, typ Type) Node func Leaf(typ Type) Node func Search(index ColumnIndex, value Value, typ Type) int func Type.ConvertValue(val Value, typ Type) (Value, error) func github.com/polarsignals/frostdb/dynparquet.NewNilColumnChunk(typ Type, columnIndex, numValues int) *dynparquet.NilColumnChunk var BooleanType var ByteArrayType var DoubleType var FloatType var Int32Type var Int64Type var Int96Type
The Value type is similar to the reflect.Value abstraction of Go values, but for parquet values. Value instances wrap underlying Go values mapped to one of the parquet physical types. Value instances are small, immutable objects, and usually passed by value between function calls. The zero-value of Value represents the null parquet value. AppendBytes appends the binary representation of v to b. If v is the null value, b is returned unchanged. Boolean returns v as a bool, assuming the underlying type is BOOLEAN. Byte returns v as a byte, which may truncate the underlying byte. ByteArray returns v as a []byte, assuming the underlying type is either BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY. The application must treat the returned byte slice as a read-only value, mutating the content will result in undefined behaviors. Bytes returns the binary representation of v. If v is the null value, an nil byte slice is returned. Clone returns a copy of v which does not share any pointers with it. Column returns the column index within the row that v was created from. Returns -1 if the value does not carry a column index. DefinitionLevel returns the definition level of v. Double returns v as a float64, assuming the underlying type is DOUBLE. Float returns v as a float32, assuming the underlying type is FLOAT. Format outputs a human-readable representation of v to w, using r as the formatting verb to describe how the value should be printed. The following formatting options are supported: %c prints the column index %+c prints the column index, prefixed with "C:" %d prints the definition level %+d prints the definition level, prefixed with "D:" %r prints the repetition level %+r prints the repetition level, prefixed with "R:" %q prints the quoted representation of v %+q prints the quoted representation of v, prefixed with "V:" %s prints the string representation of v %+s prints the string representation of v, prefixed with "V:" %v same as %s %+v prints a verbose representation of v %#v prints a Go value representation of v Format satisfies the fmt.Formatter interface. GoString returns a Go value string representation of v. Int32 returns v as a int32, assuming the underlying type is INT32. Int64 returns v as a int64, assuming the underlying type is INT64. Int96 returns v as a int96, assuming the underlying type is INT96. IsNull returns true if v is the null value. Kind returns the kind of v, which represents its parquet physical type. Level returns v with the repetition level, definition level, and column index set to the values passed as arguments. The method panics if either argument is negative. RepetitionLevel returns the repetition level of v. String returns a string representation of v. Uint32 returns v as a uint32, assuming the underlying type is INT32. Uint64 returns v as a uint64, assuming the underlying type is INT64. Value : github.com/apache/arrow-go/v18/internal/hashing.ByteSlice Value : expvar.Var Value : fmt.Formatter Value : fmt.GoStringer Value : fmt.Stringer Value : math/rand/v2.Source func BooleanValue(value bool) Value func ByteArrayValue(value []byte) Value func DoubleValue(value float64) Value func FixedLenByteArrayValue(value []byte) Value func FloatValue(value float32) Value func Int32Value(value int32) Value func Int64Value(value int64) Value func Int96Value(value deprecated.Int96) Value func NullValue() Value func ValueOf(v interface{}) Value func ZeroValue(kind Kind) Value func ColumnIndex.MaxValue(int) Value func ColumnIndex.MinValue(int) Value func Dictionary.Bounds(indexes []int32) (min, max Value) func Dictionary.Index(index int32) Value func Kind.Value(v []byte) Value func Page.Bounds() (min, max Value, ok bool) func Type.ConvertValue(val Value, typ Type) (Value, error) func Value.Clone() Value func Value.Level(repetitionLevel, definitionLevel, columnIndex int) Value func github.com/polarsignals/frostdb/dynparquet.ValuesForIndex(row Row, index int) []Value func github.com/polarsignals/frostdb/pqarrow.ArrowScalarToParquetValue(sc scalar.Scalar) (Value, error) func github.com/polarsignals/frostdb/query/expr.Max(columnIndex ColumnIndex) Value func github.com/polarsignals/frostdb/query/expr.Min(columnIndex ColumnIndex) Value func AppendRow(row Row, columns ...[]Value) Row func DeepEqual(v1, v2 Value) bool func Equal(v1, v2 Value) bool func Find(index ColumnIndex, value Value, cmp func(Value, Value) int) int func MakeRow(columns ...[]Value) Row func Search(index ColumnIndex, value Value, typ Type) int func BloomFilter.Check(value Value) (bool, error) func ColumnBuffer.ReadValuesAt([]Value, int64) (int, error) func ColumnBuffer.WriteValues([]Value) (int, error) func ColumnIndexer.IndexPage(numValues, numNulls int64, min, max Value) func Dictionary.Insert(indexes []int32, values []Value) func Dictionary.Lookup(indexes []int32, values []Value) func (*RowBuilder).Add(columnIndex int, columnValue Value) func Type.AssignValue(dst reflect.Value, src Value) error func Type.Compare(a, b Value) int func Type.ConvertValue(val Value, typ Type) (Value, error) func ValueReader.ReadValues([]Value) (int, error) func ValueReaderAt.ReadValuesAt([]Value, int64) (int, error) func ValueReaderFunc.ReadValues(values []Value) (int, error) func ValueWriter.WriteValues([]Value) (int, error) func ValueWriterFunc.WriteValues(values []Value) (int, error) func github.com/polarsignals/frostdb/pqarrow/builder.(*OptBinaryBuilder).AppendParquetValues(values []Value) error func github.com/polarsignals/frostdb/pqarrow/builder.(*OptBooleanBuilder).AppendParquetValues(values []Value) func github.com/polarsignals/frostdb/pqarrow/builder.(*OptFloat64Builder).AppendParquetValues(values []Value) func github.com/polarsignals/frostdb/pqarrow/builder.(*OptInt32Builder).AppendParquetValues(values []Value) func github.com/polarsignals/frostdb/pqarrow/builder.(*OptInt64Builder).AppendParquetValues(values []Value) func github.com/polarsignals/frostdb/pqarrow/writer.PageWriter.Write([]Value) func github.com/polarsignals/frostdb/pqarrow/writer.ValueWriter.Write([]Value) func github.com/polarsignals/frostdb/query/expr.BinaryScalarOperation(left ColumnChunk, right Value, operator logicalplan.Op) (bool, error)
ValueReader is an interface implemented by types that support reading batches of values. Read values into the buffer passed as argument and return the number of values read. When all values have been read, the error will be io.EOF. ValueReaderFunc func Page.Values() ValueReader func CopyValues(dst ValueWriter, src ValueReader) (int64, error) func ValueReaderFrom.ReadValuesFrom(ValueReader) (int64, error)
ValueReaderAt is an interface implemented by types that support reading values at offsets specified by the application. ( ValueReaderAt) ReadValuesAt([]Value, int64) (int, error) ColumnBuffer (interface)
ValueReaderFrom is an interface implemented by value writers to read values from a reader. ( ValueReaderFrom) ReadValuesFrom(ValueReader) (int64, error)
ValueReaderFunc is a function type implementing the ValueReader interface. ( ValueReaderFunc) ReadValues(values []Value) (int, error) ValueReaderFunc : ValueReader
ValueWriter is an interface implemented by types that support reading batches of values. Write values from the buffer passed as argument and returns the number of values written. ColumnBuffer (interface) ValueWriterFunc func CopyValues(dst ValueWriter, src ValueReader) (int64, error) func ValueWriterTo.WriteValuesTo(ValueWriter) (int64, error)
ValueWriterFunc is a function type implementing the ValueWriter interface. ( ValueWriterFunc) WriteValues(values []Value) (int, error) ValueWriterFunc : ValueWriter
ValueWriterTo is an interface implemented by value readers to write values to a writer. ( ValueWriterTo) WriteValuesTo(ValueWriter) (int64, error)
Deprecated: A Writer uses a parquet schema and sequence of Go values to produce a parquet file to an io.Writer. This example showcases a typical use of parquet writers: writer := parquet.NewWriter(output) for _, row := range rows { if err := writer.Write(row); err != nil { ... } } if err := writer.Close(); err != nil { ... } The Writer type optimizes for minimal memory usage, each page is written as soon as it has been filled so only a single page per column needs to be held in memory and as a result, there are no opportunities to sort rows within an entire row group. Programs that need to produce parquet files with sorted row groups should use the Buffer type to buffer and sort the rows prior to writing them to a Writer. For programs building with Go 1.18 or later, the GenericWriter[T] type supersedes this one. Close must be called after all values were produced to the writer in order to flush all buffers and write the parquet footer. Flush flushes all buffers into a row group to the underlying io.Writer. Flush is called automatically on Close, it is only useful to call explicitly if the application needs to limit the size of row groups or wants to produce multiple row groups per file. If the writer attempts to create more than MaxRowGroups row groups the method returns ErrTooManyRowGroups. ReadRowsFrom reads rows from the reader passed as arguments and writes them to w. This is similar to calling WriteRow repeatedly, but will be more efficient if optimizations are supported by the reader. Reset clears the state of the writer without flushing any of the buffers, and setting the output to the io.Writer passed as argument, allowing the writer to be reused to produce another parquet file. Reset may be called at any time, including after a writer was closed. Schema returns the schema of rows written by w. The returned value will be nil if no schema has yet been configured on w. SetKeyValueMetadata sets a key/value pair in the Parquet file metadata. Keys are assumed to be unique, if the same key is repeated multiple times the last value is retained. While the parquet format does not require unique keys, this design decision was made to optimize for the most common use case where applications leverage this extension mechanism to associate single values to keys. This may create incompatibilities with other parquet libraries, or may cause some key/value pairs to be lost when open parquet files written with repeated keys. We can revisit this decision if it ever becomes a blocker. Write is called to write another row to the parquet file. The method uses the parquet schema configured on w to traverse the Go value and decompose it into a set of columns and values. If no schema were passed to NewWriter, it is deducted from the Go type of the row, which then have to be a struct or pointer to struct. WriteRowGroup writes a row group to the parquet file. Buffered rows will be flushed prior to writing rows from the group, unless the row group was empty in which case nothing is written to the file. The content of the row group is flushed to the writer; after the method returns successfully, the row group will be empty and in ready to be reused. WriteRows is called to write rows to the parquet file. The Writer must have been given a schema when NewWriter was called, otherwise the structure of the parquet file cannot be determined from the row only. The row is expected to contain values for each column of the writer's schema, in the order produced by the parquet.(*Schema).Deconstruct method. *Writer : RowGroupWriter *Writer : RowReaderFrom *Writer : RowWriter *Writer : RowWriterWithSchema *Writer : github.com/apache/thrift/lib/go/thrift.Flusher *Writer : github.com/polarsignals/frostdb.ParquetWriter *Writer : github.com/prometheus/common/expfmt.Closer *Writer : io.Closer func NewWriter(output io.Writer, options ...WriterOption) *Writer
The WriterConfig type carries configuration options for parquet writers. WriterConfig implements the WriterOption interface so it can be used directly as argument to the NewWriter function when needed, for example: writer := parquet.NewWriter(output, schema, &parquet.WriterConfig{ CreatedBy: "my test program", }) BloomFilters []BloomFilterColumn ColumnIndexSizeLimit int ColumnPageBuffers BufferPool Compression compress.Codec CreatedBy string DataPageStatistics bool DataPageVersion int KeyValueMetadata map[string]string MaxRowsPerRowGroup int64 PageBufferSize int Schema *Schema SkipPageBounds [][]string Sorting SortingConfig WriteBufferSize int Apply applies the given list of options to c. ConfigureWriter applies configuration options from c to config. Validate returns a non-nil error if the configuration of c is invalid. *WriterConfig : WriterOption func DefaultWriterConfig() *WriterConfig func NewWriterConfig(options ...WriterOption) (*WriterConfig, error) func (*Schema).ConfigureWriter(config *WriterConfig) func (*WriterConfig).ConfigureWriter(config *WriterConfig) func WriterOption.ConfigureWriter(*WriterConfig)
WriterOption is an interface implemented by types that carry configuration options for parquet writers. ( WriterOption) ConfigureWriter(*WriterConfig) *Schema *WriterConfig func BloomFilters(filters ...BloomFilterColumn) WriterOption func ColumnIndexSizeLimit(sizeLimit int) WriterOption func ColumnPageBuffers(buffers BufferPool) WriterOption func Compression(codec compress.Codec) WriterOption func CreatedBy(application, version, build string) WriterOption func DataPageStatistics(enabled bool) WriterOption func DataPageVersion(version int) WriterOption func KeyValueMetadata(key, value string) WriterOption func MaxRowsPerRowGroup(numRows int64) WriterOption func PageBufferSize(size int) WriterOption func SkipPageBounds(path ...string) WriterOption func SortingWriterConfig(options ...SortingOption) WriterOption func WriteBufferSize(size int) WriterOption func NewGenericWriter[T](output io.Writer, options ...WriterOption) *GenericWriter[T] func NewSortingWriter[T](output io.Writer, sortRowCount int64, options ...WriterOption) *SortingWriter[T] func NewWriter(output io.Writer, options ...WriterOption) *Writer func NewWriterConfig(options ...WriterOption) (*WriterConfig, error) func Write[T](w io.Writer, rows []T, options ...WriterOption) error func WriteFile[T](path string, rows []T, options ...WriterOption) error func (*WriterConfig).Apply(options ...WriterOption) func github.com/polarsignals/frostdb/dynparquet.(*Schema).NewWriter(w io.Writer, dynamicColumns map[string][]string, sorting bool, options ...WriterOption) (dynparquet.ParquetWriter, error)
Package-Level Functions (total 128)
AppendRow appends to row the given list of column values. AppendRow can be used to construct a Row value from columns, while retaining the underlying memory buffer to avoid reallocation; for example: The function panics if the column indexes of values in each column do not match their position in the argument list.
Ascending constructs a SortingColumn value which dictates to sort the column at the path given as argument in ascending order.
AsyncPages wraps the given Pages instance to perform page reads asynchronously in a separate goroutine. Performing page reads asynchronously is important when the application may be reading pages from a high latency backend, and the last page read may be processed while initiating reading of the next page.
BloomFilters creates a configuration option which defines the bloom filters that parquet writers should generate. The compute and memory footprint of generating bloom filters for all columns of a parquet schema can be significant, so by default no filters are created and applications need to explicitly declare the columns that they want to create filters for.
BooleanValue constructs a BOOLEAN parquet value from the bool passed as argument.
BSON constructs a leaf node of BSON logical type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#bson
ByteArrayValue constructs a BYTE_ARRAY parquet value from the byte slice passed as argument.
ColumnBufferCapacity creates a configuration option which defines the size of row group column buffers. Defaults to 16384.
ColumnIndexSizeLimit creates a configuration option to customize the size limit of page boundaries recorded in column indexes. Defaults to 16.
ColumnPageBuffers creates a configuration option to customize the buffer pool used when constructing row groups. This can be used to provide on-disk buffers as swap space to ensure that the parquet file creation will no be bottlenecked on the amount of memory available. Defaults to using in-memory buffers.
CompareDescending constructs a comparison function which inverses the order of values.
CompareNullsFirst constructs a comparison function which assumes that null values are smaller than all other values.
CompareNullsLast constructs a comparison function which assumes that null values are greater than all other values.
Compressed wraps the node passed as argument to use the given compression codec. If the codec is nil, the node's compression is left unchanged. The function panics if it is called on a non-leaf node.
Compression creates a configuration option which sets the default compression codec used by a writer for columns where none were defined.
Convert constructs a conversion function from one parquet schema to another. The function supports converting between schemas where the source or target have extra columns; if there are more columns in the source, they will be stripped out of the rows. Extra columns in the target schema will be set to null or zero values. The returned function is intended to be used to append the converted source row to the destination buffer.
ConvertRowGroup constructs a wrapper of the given row group which applies the given schema conversion to its rows.
ConvertRowReader constructs a wrapper of the given row reader which applies the given schema conversion to the rows.
CopyPages copies pages from src to dst, returning the number of values that were copied. The function returns any error it encounters reading or writing pages, except for io.EOF from the reader which indicates that there were no more pages to read.
CopyRows copies rows from src to dst. The underlying types of src and dst are tested to determine if they expose information about the schema of rows that are read and expected to be written. If the schema information are available but do not match, the function will attempt to automatically convert the rows from the source schema to the destination. As an optimization, the src argument may implement RowWriterTo to bypass the default row copy logic and provide its own. The dst argument may also implement RowReaderFrom for the same purpose. The function returns the number of rows written, or any error encountered other than io.EOF.
CopyValues copies values from src to dst, returning the number of values that were written. As an optimization, the reader and writer may choose to implement ValueReaderFrom and ValueWriterTo to provide their own copy logic. The function returns any error it encounters reading or writing pages, except for io.EOF from the reader which indicates that there were no more values to read.
CreatedBy creates a configuration option which sets the name of the application that created a parquet file. The option formats the "CreatedBy" file metadata according to the convention described by the parquet spec: "<application> version <version> (build <build>)" By default, the option is set to the parquet-go module name, version, and build hash.
DataPageStatistics creates a configuration option which defines whether data page statistics are emitted. This option is useful when generating parquet files that intend to be backward compatible with older readers which may not have the ability to load page statistics from the column index. Defaults to false.
DataPageVersion creates a configuration option which configures the version of data pages used when creating a parquet file. Defaults to version 2.
Date constructs a leaf node of DATE logical type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date
Decimal constructs a leaf node of decimal logical type with the given scale, precision, and underlying type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
DedupeRowReader constructs a row reader which drops duplicated consecutive rows, according to the comparator function passed as argument. If the underlying reader produces a sequence of rows sorted by the same comparison predicate, the output is guaranteed to produce unique rows only.
DedupeRowWriter constructs a row writer which drops duplicated consecutive rows, according to the comparator function passed as argument. If the writer is given a sequence of rows sorted by the same comparison predicate, the output is guaranteed to contain unique rows only.
DeepEqual returns true if v1 and v2 are equal, including their repetition levels, definition levels, and column indexes. See Equal for details about how value equality is determined.
DefaultFileConfig returns a new FileConfig value initialized with the default file configuration.
DefaultReaderConfig returns a new ReaderConfig value initialized with the default reader configuration.
DefaultRowGroupConfig returns a new RowGroupConfig value initialized with the default row group configuration.
DefaultSortingConfig returns a new SortingConfig value initialized with the default row group configuration.
DefaultWriterConfig returns a new WriterConfig value initialized with the default writer configuration.
Descending constructs a SortingColumn value which dictates to sort the column at the path given as argument in descending order.
DoubleValue constructs a DOUBLE parquet value from the float64 passed as argument.
DropDuplicatedRows configures whether a sorting writer will keep or remove duplicated rows. Two rows are considered duplicates if the values of their all their sorting columns are equal. Defaults to false
Encoded wraps the node passed as argument to use the given encoding. The function panics if it is called on a non-leaf node, or if the encoding does not support the node type.
Enum constructs a leaf node with a logical type representing enumerations. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#enum
Equal returns true if v1 and v2 are equal. Values are considered equal if they are of the same physical type and hold the same Go values. For BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY, the content of the underlying byte arrays are tested for equality. Note that the repetition levels, definition levels, and column indexes are not compared by this function, use DeepEqual instead.
FieldID wraps a node to provide node field id
FileReadMode is a file configuration option which controls the way pages are read. Currently the only two options are ReadModeAsync and ReadModeSync which control whether or not pages are loaded asynchronously. It can be advantageous to use ReadModeAsync if your reader is backed by network storage. Defaults to ReadModeSync.
FileSchema is used to pass a known schema in while opening a Parquet file. This optimization is only useful if your application is currently opening an extremely large number of parquet files with the same, known schema. Defaults to nil.
FilterRowReader constructs a RowReader which exposes rows from reader for which the predicate has returned true.
FilterRowWriter constructs a RowWriter which writes rows to writer for which the predicate has returned true.
Find uses the ColumnIndex passed as argument to find the page in a column chunk (determined by the given ColumnIndex) that the given value is expected to be found in. The function returns the index of the first page that might contain the value. If the function determines that the value does not exist in the index, NumPages is returned. If you want to search the entire parquet file, you must iterate over the RowGroups and search each one individually, if there are multiple in the file. If you call writer.Flush before closing the file, then you will have multiple RowGroups to iterate over, otherwise Flush is called once on Close. The comparison function passed as last argument is used to determine the relative order of values. This should generally be the Compare method of the column type, but can sometimes be customized to modify how null values are interpreted, for example: pageIndex := parquet.Find(columnIndex, value, parquet.CompareNullsFirst(typ.Compare), )
FixedLenByteArrayType constructs a type for fixed-length values of the given size (in bytes).
FixedLenByteArrayValue constructs a BYTE_ARRAY parquet value from the byte slice passed as argument.
FloatValue constructs a FLOAT parquet value from the float32 passed as argument.
Int constructs a leaf node of signed integer logical type of the given bit width. The bit width must be one of 8, 16, 32, 64, or the function will panic.
Int32Value constructs a INT32 parquet value from the int32 passed as argument.
Int64Value constructs a INT64 parquet value from the int64 passed as argument.
Int96Value constructs a INT96 parquet value from the deprecated.Int96 passed as argument.
JSON constructs a leaf node of JSON logical type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
KeyValueMetadata creates a configuration option which adds key/value metadata to add to the metadata of parquet files. This option is additive, it may be used multiple times to add more than one key/value pair. Keys are assumed to be unique, if the same key is repeated multiple times the last value is retained. While the parquet format does not require unique keys, this design decision was made to optimize for the most common use case where applications leverage this extension mechanism to associate single values to keys. This may create incompatibilities with other parquet libraries, or may cause some key/value pairs to be lost when open parquet files written with repeated keys. We can revisit this decision if it ever becomes a blocker.
Leaf returns a leaf node of the given type.
List constructs a node of LIST logical type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
LookupCompressionCodec returns the compression codec associated with the given code. The function never returns nil. If the encoding is not supported, an "unsupported" codec is returned.
LookupEncoding returns the parquet encoding associated with the given code. The function never returns nil. If the encoding is not supported, encoding.NotSupported is returned.
MakeRow constructs a Row from a list of column values. The function panics if the column indexes of values in each column do not match their position in the argument list.
Map constructs a node of MAP logical type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps
MaxRowsPerRowGroup configures the maximum number of rows that a writer will produce in each row group. This limit is useful to control size of row groups in both number of rows and byte size. While controlling the byte size of a row group is difficult to achieve with parquet due to column encoding and compression, the number of rows remains a useful proxy. Defaults to unlimited.
MergeRowGroups constructs a row group which is a merged view of rowGroups. If rowGroups are sorted and the passed options include sorting, the merged row group will also be sorted. The function validates the input to ensure that the merge operation is possible, ensuring that the schemas match or can be converted to an optionally configured target schema passed as argument in the option list. The sorting columns of each row group are also consulted to determine whether the output can be represented. If sorting columns are configured on the merge they must be a prefix of sorting columns of all row groups being merged.
MergeRowReader constructs a RowReader which creates an ordered sequence of all the readers using the given compare function as the ordering predicate.
MultiRowGroup wraps multiple row groups to appear as if it was a single RowGroup. RowGroups must have the same schema or it will error.
MultiRowWriter constructs a RowWriter which dispatches writes to all the writers passed as arguments. When writing rows, if any of the writers returns an error, the operation is aborted and the error returned. If one of the writers did not error, but did not write all the rows, the operation is aborted and io.ErrShortWrite is returned. Rows are written sequentially to each writer in the order they are given to this function.
NewBuffer constructs a new buffer, using the given list of buffer options to configure the buffer returned by the function. The function panics if the buffer configuration is invalid. Programs that cannot guarantee the validity of the options passed to NewBuffer should construct the buffer configuration independently prior to calling this function: config, err := parquet.NewRowGroupConfig(options...) if err != nil { // handle the configuration error ... } else { // this call to create a buffer is guaranteed not to panic buffer := parquet.NewBuffer(config) ... }
NewBufferPool creates a new in-memory page buffer pool. The implementation is backed by sync.Pool and allocates memory buffers on the Go heap.
NewChunkBufferPool creates a new in-memory page buffer pool. The implementation is backed by sync.Pool and allocates memory buffers on the Go heap in fixed-size chunks.
NewColumnIndex constructs a ColumnIndex instance from the given parquet format column index. The kind argument configures the type of values
NewFileBufferPool creates a new on-disk page buffer pool.
NewFileConfig constructs a new file configuration applying the options passed as arguments. The function returns an non-nil error if some of the options carried invalid configuration values.
Type Parameters: T: any NewGenericBuffer is like NewBuffer but returns a GenericBuffer[T] suited to write rows of Go type T. The type parameter T should be a map, struct, or any. Any other types will cause a panic at runtime. Type checking is a lot more effective when the generic parameter is a struct type, using map and interface types is somewhat similar to using a Writer. If using an interface type for the type parameter, then providing a schema at instantiation is required. If the option list may explicitly declare a schema, it must be compatible with the schema generated from T.
Type Parameters: T: any NewGenericReader is like NewReader but returns GenericReader[T] suited to write rows of Go type T. The type parameter T should be a map, struct, or any. Any other types will cause a panic at runtime. Type checking is a lot more effective when the generic parameter is a struct type, using map and interface types is somewhat similar to using a Writer. If the option list may explicitly declare a schema, it must be compatible with the schema generated from T.
Type Parameters: T: any
Type Parameters: T: any NewGenericWriter is like NewWriter but returns a GenericWriter[T] suited to write rows of Go type T. The type parameter T should be a map, struct, or any. Any other types will cause a panic at runtime. Type checking is a lot more effective when the generic parameter is a struct type, using map and interface types is somewhat similar to using a Writer. If the option list may explicitly declare a schema, it must be compatible with the schema generated from T. Sorting columns may be set on the writer to configure the generated row groups metadata. However, rows are always written in the order they were seen, no reordering is performed, the writer expects the application to ensure proper correlation between the order of rows and the list of sorting columns. See SortingWriter[T] for a writer which handles reordering rows based on the configured sorting columns.
NewReader constructs a parquet reader reading rows from the given io.ReaderAt. In order to read parquet rows, the io.ReaderAt must be converted to a parquet.File. If r is already a parquet.File it is used directly; otherwise, the io.ReaderAt value is expected to either have a `Size() int64` method or implement io.Seeker in order to determine its size. The function panics if the reader configuration is invalid. Programs that cannot guarantee the validity of the options passed to NewReader should construct the reader configuration independently prior to calling this function: config, err := parquet.NewReaderConfig(options...) if err != nil { // handle the configuration error ... } else { // this call to create a reader is guaranteed not to panic reader := parquet.NewReader(input, config) ... }
NewReaderConfig constructs a new reader configuration applying the options passed as arguments. The function returns an non-nil error if some of the options carried invalid configuration values.
Type Parameters: T: any NewRowBuffer constructs a new row buffer.
NewRowBuilder constructs a RowBuilder which builds rows for the parquet schema passed as argument.
NewRowGroupConfig constructs a new row group configuration applying the options passed as arguments. The function returns an non-nil error if some of the options carried invalid configuration values.
NewRowGroupReader constructs a new Reader which reads rows from the RowGroup passed as argument.
NewSchema constructs a new Schema object with the given name and root node. The function panics if Node contains more leaf columns than supported by the package (see parquet.MaxColumnIndex).
NewSortingConfig constructs a new sorting configuration applying the options passed as arguments. The function returns an non-nil error if some of the options carried invalid configuration values.
Type Parameters: T: any NewSortingWriter constructs a new sorting writer which writes a parquet file where rows of each row group are ordered according to the sorting columns configured on the writer. The sortRowCount argument defines the target number of rows that will be sorted in memory before being written to temporary row groups. The greater this value the more memory is needed to buffer rows in memory. Choosing a value that is too small limits the maximum number of rows that can exist in the output file since the writer cannot create more than 32K temporary row groups to hold the sorted row chunks.
NewWriter constructs a parquet writer writing a file to the given io.Writer. The function panics if the writer configuration is invalid. Programs that cannot guarantee the validity of the options passed to NewWriter should construct the writer configuration independently prior to calling this function: config, err := parquet.NewWriterConfig(options...) if err != nil { // handle the configuration error ... } else { // this call to create a writer is guaranteed not to panic writer := parquet.NewWriter(output, config) ... }
NewWriterConfig constructs a new writer configuration applying the options passed as arguments. The function returns an non-nil error if some of the options carried invalid configuration values.
NullsFirst wraps the SortingColumn passed as argument so that it instructs the row group to place null values first in the column.
NulLValue constructs a null value, which is the zero-value of the Value type.
OpenFile opens a parquet file and reads the content between offset 0 and the given size in r. Only the parquet magic bytes and footer are read, column chunks and other parts of the file are left untouched; this means that successfully opening a file does not validate that the pages have valid checksums.
Optional wraps the given node to make it optional.
PageBufferSize configures the size of column page buffers on parquet writers. Note that the page buffer size refers to the in-memory buffers where pages are generated, not the size of pages after encoding and compression. This design choice was made to help control the amount of memory needed to read and write pages rather than controlling the space used by the encoded representation on disk. Defaults to 256KiB.
func PrintPage(w io.Writer, page Page) error
func PrintRowGroup(w io.Writer, rowGroup RowGroup) error
func PrintSchema(w io.Writer, name string, node Node) error
func PrintSchemaIndent(w io.Writer, name string, node Node, pattern, newline string) error
Type Parameters: T: any Read reads and returns rows from the parquet file in the given reader. The type T defines the type of rows read from r. T must be compatible with the file's schema or an error will be returned. The row type might represent a subset of the full schema, in which case only a subset of the columns will be loaded from r. This function is provided for convenience to facilitate reading of parquet files from arbitrary locations in cases where the data set fit in memory.
ReadBufferSize is a file configuration option which controls the default buffer sizes for reads made to the provided io.Reader. The default of 4096 is appropriate for disk based access but if your reader is backed by network storage it can be advantageous to increase this value to something more like 4 MiB. Defaults to 4096.
Type Parameters: T: any ReadFile reads rows of the parquet file at the given path. The type T defines the type of rows read from r. T must be compatible with the file's schema or an error will be returned. The row type might represent a subset of the full schema, in which case only a subset of the columns will be loaded from the file. This function is provided for convenience to facilitate reading of parquet files from the file system in cases where the data set fit in memory.
Release is a helper function to decrement the reference counter of pages backed by memory which can be granularly managed by the application. Usage of this is optional and with Retain, is intended to allow finer grained memory management in the application, at the expense of potentially causing panics if the page is used after its reference count has reached zero. Most programs should be able to rely on automated memory management provided by the Go garbage collector instead. The function should be called to return a page to the internal buffer pool, when a goroutine "releases ownership" it acquired either by being the single owner (e.g. capturing the return value from a ReadPage call) or having gotten shared ownership by calling Retain. Calling this function on pages that do not embed a reference counter does nothing.
Repeated wraps the given node to make it repeated.
Required wraps the given node to make it required.
Retain is a helper function to increment the reference counter of pages backed by memory which can be granularly managed by the application. Usage of this function is optional and with Release, is intended to allow finer grain memory management in the application. Most programs should be able to rely on automated memory management provided by the Go garbage collector instead. The function should be called when a page lifetime is about to be shared between multiple goroutines or layers of an application, and the program wants to express "sharing ownership" of the page. Calling this function on pages that do not embed a reference counter does nothing.
ScanRowReader constructs a RowReader which exposes rows from reader until the predicate returns false for one of the rows, or EOF is reached.
SchemaOf constructs a parquet schema from a Go value. The function can construct parquet schemas from struct or pointer-to-struct values only. A panic is raised if a Go value of a different type is passed to this function. When creating a parquet Schema from a Go value, the struct fields may contain a "parquet" tag to describe properties of the parquet node. The "parquet" tag follows the conventional format of Go struct tags: a comma-separated list of values describe the options, with the first one defining the name of the parquet column. The following options are also supported in the "parquet" struct tag: optional | make the parquet column optional snappy | sets the parquet column compression codec to snappy gzip | sets the parquet column compression codec to gzip brotli | sets the parquet column compression codec to brotli lz4 | sets the parquet column compression codec to lz4 zstd | sets the parquet column compression codec to zstd plain | enables the plain encoding (no-op default) dict | enables dictionary encoding on the parquet column delta | enables delta encoding on the parquet column list | for slice types, use the parquet LIST logical type enum | for string types, use the parquet ENUM logical type uuid | for string and [16]byte types, use the parquet UUID logical type decimal | for int32, int64 and [n]byte types, use the parquet DECIMAL logical type date | for int32 types use the DATE logical type timestamp | for int64 types use the TIMESTAMP logical type with, by default, millisecond precision split | for float32/float64, use the BYTE_STREAM_SPLIT encoding id(n) | where n is int denoting a column field id. Example id(2) for a column with field id of 2 # The date logical type is an int32 value of the number of days since the unix epoch The timestamp precision can be changed by defining which precision to use as an argument. Supported precisions are: nanosecond, millisecond and microsecond. Example: type Message struct { TimestrampMicros int64 `parquet:"timestamp_micros,timestamp(microsecond)" } The decimal tag must be followed by two integer parameters, the first integer representing the scale and the second the precision; for example: type Item struct { Cost int64 `parquet:"cost,decimal(0:3)"` } Invalid combination of struct tags and Go types, or repeating options will cause the function to panic. As a special case, if the field tag is "-", the field is omitted from the schema and the data will not be written into the parquet file(s). Note that a field with name "-" can still be generated using the tag "-,". The configuration of Parquet maps are done via two tags: - The `parquet-key` tag allows to configure the key of a map. - The parquet-value tag allows users to configure a map's values, for example to declare their native Parquet types. When configuring a Parquet map, the `parquet` tag will configure the map itself. For example, the following will set the int64 key of the map to be a timestamp: type Actions struct { Action map[int64]string `parquet:"," parquet-key:",timestamp"` } The schema name is the Go type name of the value.
SkipBloomFilters is a file configuration option which prevents automatically reading the bloom filters when opening a parquet file, when set to true. This is useful as an optimization when programs know that they will not need to consume the bloom filters. Defaults to false.
SkipPageBounds lists the path to a column that shouldn't have bounds written to the footer of the parquet file. This is useful for data blobs, like a raw html file, where the bounds are not meaningful. This option is additive, it may be used multiple times to skip multiple columns.
SkipPageIndex is a file configuration option which prevents automatically reading the page index when opening a parquet file, when set to true. This is useful as an optimization when programs know that they will not need to consume the page index. Defaults to false.
SortingBuffers creates a configuration option which sets the pool of buffers used to hold intermediary state when sorting parquet rows. Defaults to using in-memory buffers.
SortingColumns creates a configuration option which defines the sorting order of columns in a row group. The order of sorting columns passed as argument defines the ordering hierarchy; when elements are equal in the first column, the second column is used to order rows, etc...
SortingRowGroupConfig is a row group option which applies configuration specific sorting row groups.
SortingWriterConfig is a writer option which applies configuration specific to sorting writers.
SplitBlockFilter constructs a split block bloom filter object for the column at the given path, with the given bitsPerValue. If you are unsure what number of bitsPerValue to use, 10 is a reasonable tradeoff between size and error rate for common datasets. For more information on the tradeoff between size and error rate, consult this website: https://hur.st/bloomfilter/?n=4000&p=0.1&m=&k=1
String constructs a leaf node of UTF8 logical type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string
Time constructs a leaf node of TIME logical type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#time
Timestamp constructs of leaf node of TIMESTAMP logical type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
TransformRowReader constructs a RowReader which applies the given transform to each row rad from reader. The transformation function appends the transformed src row to dst, returning dst and any error that occurred during the transformation. If dst is returned unchanged, the row is skipped.
TransformRowWriter constructs a RowWriter which applies the given transform to each row writter to writer. The transformation function appends the transformed src row to dst, returning dst and any error that occurred during the transformation. If dst is returned unchanged, the row is skipped.
Uint constructs a leaf node of unsigned integer logical type of the given bit width. The bit width must be one of 8, 16, 32, 64, or the function will panic.
UUID constructs a leaf node of UUID logical type. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid
ValueOf constructs a parquet value from a Go value v. The physical type of the value is assumed from the Go type of v using the following conversion table: Go type | Parquet physical type ------- | --------------------- nil | NULL bool | BOOLEAN int8 | INT32 int16 | INT32 int32 | INT32 int64 | INT64 int | INT64 uint8 | INT32 uint16 | INT32 uint32 | INT32 uint64 | INT64 uintptr | INT64 float32 | FLOAT float64 | DOUBLE string | BYTE_ARRAY []byte | BYTE_ARRAY [*]byte | FIXED_LEN_BYTE_ARRAY When converting a []byte or [*]byte value, the underlying byte array is not copied; instead, the returned parquet value holds a reference to it. The repetition and definition levels of the returned value are both zero. The function panics if the Go value cannot be represented in parquet.
Type Parameters: T: any Write writes the given list of rows to a parquet file written to w. This function is provided for convenience to facilitate the creation of parquet files.
WriteBufferSize configures the size of the write buffer. Setting the writer buffer size to zero deactivates buffering, all writes are immediately sent to the output io.Writer. Defaults to 32KiB.
Type Parameters: T: any Write writes the given list of rows to a parquet file written to w. This function is provided for convenience to facilitate writing parquet files to the file system.
ZeroValue constructs a zero value of the given kind.
Package-Level Variables (total 39)
BitPacked is the deprecated bit-packed encoding for repetition and definition levels.
Brotli is the BROTLI parquet compression codec.
ByteStreamSplit is an encoding for floating-point data.
DeltaBinaryPacked is the delta binary packed parquet encoding.
DeltaByteArray is the delta byte array parquet encoding.
DeltaLengthByteArray is the delta length byte array parquet encoding.
ErrCorrupted is an error returned by the Err method of ColumnPages instances when they encountered a mismatch between the CRC checksum recorded in a page header and the one computed while reading the page data.
ErrConversion is used to indicate that a conversion betwen two values cannot be done because there are no rules to translate between their physical types.
ErrMissingPageHeader is an error returned when a page reader encounters a malformed page header which is missing page-type-specific information.
ErrMissingRootColumn is an error returned when opening an invalid parquet file which does not have a root column.
ErrRowGroupSchemaMismatch is an error returned when attempting to write a row group but the source and destination schemas differ.
ErrRowGroupSchemaMissing is an error returned when attempting to write a row group but the source has no schema.
ErrRowGroupSortingColumnsMismatch is an error returned when attempting to write a row group but the sorting columns differ in the source and destination.
ErrSeekOutOfRange is an error returned when seeking to a row index which is less than the first row of a page.
ErrTooManyRowGroups is returned when attempting to generate a parquet file with more than MaxRowGroups row groups.
ErrUnexpectedDefinitionLevels is an error returned when attempting to decode definition levels into a page which is part of a required column.
ErrUnexpectedDictionaryPage is an error returned when a page reader encounters a dictionary page after the first page, or in a column which does not use a dictionary encoding.
ErrUnexpectedRepetitionLevels is an error returned when attempting to decode repetition levels into a page which is not part of a repeated column.
Gzip is the GZIP parquet compression codec.
Lz4Raw is the LZ4_RAW parquet compression codec.
Plain is the default parquet encoding.
PlainDictionary is the plain dictionary parquet encoding. This encoding should not be used anymore in parquet 2.0 and later, it is implemented for backwards compatibility to support reading files that were encoded with older parquet libraries.
RLE is the hybrid bit-pack/run-length parquet encoding.
RLEDictionary is the RLE dictionary parquet encoding.
Snappy is the SNAPPY parquet compression codec.
Uncompressed is a parquet compression codec representing uncompressed pages.
Zstd is the ZSTD parquet compression codec.
Package-Level Constants (total 25)
const Boolean Kind = 0
const ByteArray Kind = 6
const DefaultMaxRowsPerRowGroup = 9223372036854775807
const DefaultPageBufferSize = 262144
const DefaultSkipPageIndex = false
const DefaultWriteBufferSize = 32768
const Double Kind = 5
const Float Kind = 4
const Int32 Kind = 1
const Int64 Kind = 2
const Int96 Kind = 3
MaxColumnDepth is the maximum column depth supported by this package.
MaxColumnIndex is the maximum column index supported by this package.
MaxDefinitionLevel is the maximum definition level supported by this package.
MaxRepetitionLevel is the maximum repetition level supported by this package.
MaxRowGroups is the maximum number of row groups which can be contained in a single parquet file. This limit is enforced by the use of 16 bits signed integers in the file metadata footer of parquet files. It is part of the parquet specification and therefore cannot be changed.
const ReadModeAsync ReadMode = 1 // ReadModeAsync reads pages asynchronously in the background.
const ReadModeSync ReadMode = 0 // ReadModeSync reads pages synchronously on demand (Default).