Schema
Parquet is a format that stores data in a structured way. It has different types for different kinds of data, like numbers, strings, dates and so on. This means that you have to tell Parquet what type each column of your data is before you can write it to a file. This is called declaring a schema. Declaring a schema helps Parquet to compress and read your data more efficiently.
Schema can be defined by creating an instance of ParquetSchema
class and passing a collection of Field
. Various helper methods on both DataSet
and ParquetSchema
exist to simplify the schema declaration, but we are going to be more specific on this page.
There are several types of fields you can specify in your schema, and the most common is DataField
. DataField
is derived from the base abstract Field
class (just like all the rest of the field types) and simply means in declares an actual data rather than an abstraction.
You can declare a DataField
by specifying a column name, and it's type in the constructor, in one of two forms:
The first one is more declarative and allows you to select data type from the DataType
enumeration containing the list of types we actually support at the moment.
The second one is just a shortcut to DataField
that allows you to use .NET Generics.
Then, there are specialised versions for DataField
allowing you to specify more precise metadata about certain parquet data type, for instance DecimalDataField
allows to specify precision and scale other than default values.
Non-data field wrap complex structures like list (ListField
), map (MapField
) and struct (StructField
).
Full schema type hierarchy can be expressed as:
Lists
To declare a list, use ListField
class and specify:
The name of the list field.
What data it is holding.
ListField
's second parameter is of type Field
, meaning you can specify anything as it's member - a primitive DataField
or anything else.
Lists of primitive types
To declare a list of primitive types, just specify DataField
as a second parameter. For instance, to declare a list of integers:
Lists of structs
Of course a list can contain anything, but just to demostrate it can, this is how you would declare a list of structs:
Special cases
Null Values
Declaring schema as above will allow you to add elements of type int
, however null values are not allowed (you will get an exception when trying to add a null value to the DataSet
). In order to allow nulls you need to declare them in schema explicitly by specifying a nullable type:
This allows you to force the schema to be nullable, so you can add null values. In many cases having a nullable column is useful even if you are not using nulls at the moment, for instance when you will append to the file later and will have nulls.
Nullable columns incur a slight performance and data size overhead, as parquet needs to store an additional nullable flag for each value.
Dates
In the old times Parquet format didn't support dates, therefore people used to store dates as int96
number. Because of backward compatibility issues we use this as the default date storage format.
If you need to override date format storage you can use DateTimeDataField
instead of DataField<DateTime>
which allows to specify precision, for example the following example lowers precision to only write date part of the date without time.
see DateTimeFormat
enumeration for detailed explanation of available options.
Decimals
Writing a decimal by default uses precision 38 and scale 18, however you can set different precision and schema by using DecimalDataField
schema element (see constructor for options).
Note that AWS Athena, Impala and possibly other systems do not conform to Parquet specifications when reading decimal fields. If this is the case, you must use DecimalDataField
explicitly and set forceByteArrayEncoding
to true
.
For instance:
Since v4.2.3 variable-size decimal encoding (variant 4) is supported by the reader.