Parquet.Net Help

Class serialisation

Parquet.Net is generally extremely flexible in terms of supporting internals of the Apache Parquet format and allows you to do whatever the low level API allow to. However, in many cases writing boilerplate code is not suitable if you are working with business objects and just want to serialise them into a parquet file.

Class serialisation is really fast as internally it generates compiled expression trees on the fly. That means there is a tiny bit of delay when serialising a first entity, which in most cases is negligible. Once the class is serialised at least once, further operations become amazingly fast (around x40 speed improvement comparing to reflection on relatively large amounts of data (~5 million records)).

Quick start

Both serialiser and deserialiser works with collection of classes. Let's say you have the following class definition:

class Record { public DateTime Timestamp { get; set; } public string EventName { get; set; } public double MeterValue { get; set; } }

Let's generate a few instances of those for a test:

var data = Enumerable.Range(0, 1_000_000).Select(i => new Record { Timestamp = DateTime.UtcNow.AddSeconds(i), EventName = i % 2 == 0 ? "on" : "off", MeterValue = i }).ToList();

Here is what you can do to write out those classes in a single file:

await ParquetSerializer.SerializeAsync(data, "/mnt/storage/data.parquet");

That's it! Of course the .SerializeAsync() method also has overloads and optional parameters allowing you to control the serialization process slightly, such as selecting compression method, row group size etc.

Parquet.Net will automatically figure out file schema by reflecting class structure, types, nullability and other parameters for you.

In order to deserialize this file back to array of classes you would write the following:

IList<Record> data = await ParquetSerializer.DeserializeAsync<Record>("/mnt/storage/data.parquet");

Deserialize records by RowGroup

If you have a large file, and you want to deserialize it in chunks, you can also read records by row group. This can help to keep memory usage low as you won't need to load the entire file into memory.

IList<Record> data = await ParquetSerializer.DeserializeAsync<Record>("/mnt/storage/data.parquet", rowGroupIndex);

Class member requirements

Parquet.Net can serialize and deserialize into class properties and class fields (class fields support was introduced in v4.23.0).

Apparently, in order to serialize a class, a property must be readable, and in order to deserialize a class, a property must be writable. This is a standard requirement for any serialisation library.

In case of fields, they are by default both readable and writable, so you don't need to do anything special to make them work.

Customising serialisation

Serialisation tries to fit into C# ecosystem like a ninja 🥷, including customisations. It supports the following attributes from System.Text.Json.Serialization Namespace:

  • JsonPropertyName - changes mapping of column name to property name.

  • JsonIgnore - ignores property when reading or writing.

  • JsonPropertyOrder - allows to reorder columns when writing to file (by default they are written in class definition order). Only root properties and struct (classes) properties can be ordered (it won't make sense to do the others).

Where built-in JSON attributes are not sufficient, extra attributes are added.

Strings

In .NET, string class is a reference type, which means it can be null. However, Parquet specification allows types to be declared required or optional.

To fit into .NET ecosystem as closely as possible, this library will serialize .NET strings as optional by default. If you want to change this behaviour, you can use [ParquetRequired] attribute:

public string OptionalString { get; set; } [ParquetRequired] public string RequiredString { get; set; }

In this example of three properties, OptionalStringwill be serialized as optional, but RequiredString will be serialized as required.

Dates

By default, dates (DateTime) are serialized as INT96 number, which include nanoseconds in the day. In general, INT96 is obsolete in Parquet, however older systems such as Impala and Hive are still actively using it to represent dates.

Therefore, when this library sees INT96 type, it will automatically treat it as a date for both serialization and deserialization.

If you need to rather use a normal non-legacy date type, just annotate a property with [ParquetTimestamp]:

[ParquetTimestamp] public DateTime TimestampDate { get; set; }

which by default serialises date with millisecond precision. If you need to increase precision, you can use [ParquetTimestamp] attribute with an appropriate precision:

[ParquetTimestamp(ParquetTimestampResolution.Microseconds)] public DateTime TimestampDate { get; set; }

Times

By default, time (TimeSpan) is serialised with millisecond precision. but you can increase it by adding [ParquetMicroSecondsTime] attribute:

[ParquetMicroSecondsTime] public TimeSpan MicroTime { get; set; }

Decimals Numbers

By default, decimal is serialized with precision (number of digits in a number) of 38 and scale (number of digits to the right of the decimal point in a number) of 18. If you need to use different precision/scale pair, use [ParquetDecimal] attribute:

[ParquetDecimal(40, 20)] public decimal With_40_20 { get; set; }

Legacy Repeatable (Legacy Arrays)

One of the features of Parquet files is that they can contain simple repeatable fields, also known as arrays, that can store multiple values for a single column. However, this feature is not widely supported by most of the systems that process Parquet files, and it may cause errors or compatibility issues. An example of such a file can be found in test data folder, called legacy_primitives_collection_arrays.parquet.

If you want to read an array of primitive values, such as integers or booleans, from a parquet file created by another system, you might think that you can simply use a list property in your class, like this:

class Primitives { public List<bool>? Booleans { get; set; } } IList<Primitives> data = await ParquetSerializer.DeserializeAsync<Primitives>(input);

However, this will not work, because this library expects a list of complex objects, not a list of primitives. It will throw an exception when it encounters an array in the parquet file.

To fix this problem, you need to use the ParquetSimpleRepeatable attribute on your list property. This tells the library that the list contains simple values that can be repeated as an array. For example:

class Primitives { [ParquetSimpleRepeatable] public List<bool>? Booleans { get; set; } }

This will successfully deserialize the array of integers from the parquet file into your list property.

Nested types

You can also serialize more complex types supported by the Parquet format. Sometimes you might want to store more complex data in your parquet files, like lists or maps. These are called nested types and they can be useful for organizing your information. However, they also come with a trade-off: they make your code slower and use more CPU resources. That's why you should only use them when you really need them and not just because they look cool. Simple columns are faster and easier to work with, so stick to them whenever you can.

Structures

Structures are just class members of a class and are completely transparent. For instance, AddressBookEntry class may contain a structure called Address:

class Address { public string? Country { get; set; } public string? City { get; set; } } class AddressBookEntry { public string? FirstName { get; set; } public string? LastName { get; set; } public Address? Address { get; set; } }

Populated with the following fake data:

var data = Enumerable.Range(0, 1_000_000).Select(i => new AddressBookEntry { FirstName = "Joe", LastName = "Bloggs", Address = new Address() { Country = "UK", City = "Unknown" } }).ToList();

You can serialise/deserialize those using the same ParquetSerializer.SerializeAsync/ParquetSerializer.DeserializeAsync methods. It does understand subclasses and will magically traverse inside them.

Lists

One of the cool things about lists is that Parquet can handle any kind of data structure in a list. You can have a list of atoms, like 1, 2, 3, or a list of lists, [[1, 2], [3, 4], [5, 6]], or even a list of structures. Parquet.Net is awesome like that!

For instance, a simple MovementHistory class with Id and list of ParentIds looking like the following:

class MovementHistoryCompressed { public int? PersonId { get; set; } public List<int>? ParentIds { get; set; } }

Is totally fine to serialise/deserialise:

var data = Enumerable.Range(0, 100).Select(i => new MovementHistoryCompressed { PersonId = i, ParentIds = Enumerable.Range(i, 4).ToList() }).ToList(); await ParquetSerializer.SerializeAsync(data, "c:\\tmp\\lat.parquet");

Reading it in Spark produces the following schema

root |-- PersonId: integer (nullable = true) |-- ParentIds: array (nullable = true) | |-- element: integer (containsNull = true)

and data:

+--------+---------------+ |PersonId|ParentIds | +--------+---------------+ |0 |[0, 1, 2, 3] | |1 |[1, 2, 3, 4] | |2 |[2, 3, 4, 5] | |3 |[3, 4, 5, 6] | |4 |[4, 5, 6, 7] | |5 |[5, 6, 7, 8] | |6 |[6, 7, 8, 9] | |7 |[7, 8, 9, 10] | |8 |[8, 9, 10, 11] | |9 |[9, 10, 11, 12]| +--------+---------------+

Or as a more complicate example, here is a list of structures (classes in C#):

class Address { public string? Country { get; set; } public string? City { get; set; } } class MovementHistory { public int? PersonId { get; set; } public string? Comments { get; set; } public List<Address>? Addresses { get; set; } } var data = Enumerable.Range(0, 1_000).Select(i => new MovementHistory { PersonId = i, Comments = i % 2 == 0 ? "none" : null, Addresses = Enumerable.Range(0, 4).Select(a => new Address { City = "Birmingham", Country = "United Kingdom" }).ToList() }).ToList(); await ParquetSerializer.SerializeAsync(data, "c:\\tmp\\ls.parquet");

that by reading from Spark produced the following schema

root |-- PersonId: integer (nullable = true) |-- Comments: string (nullable = true) |-- Addresses: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- Country: string (nullable = true) | | |-- City: string (nullable = true)

and data

+--------+--------+--------------------+ |PersonId|Comments| Addresses| +--------+--------+--------------------+ | 0| none|[{United Kingdom,...| | 1| null|[{United Kingdom,...| | 2| none|[{United Kingdom,...| | 3| null|[{United Kingdom,...| | 4| none|[{United Kingdom,...| | 5| null|[{United Kingdom,...| | 6| none|[{United Kingdom,...| | 7| null|[{United Kingdom,...| | 8| none|[{United Kingdom,...| | 9| null|[{United Kingdom,...| +--------+--------+--------------------+

Maps (Dictionaries)

Maps are useful constructs if you need to serialize key-value pairs where each row can have different amount of keys. For example, if you want to store the names and hobbies of your friends, you can use a map like this:

{"Alice": ["reading", "cooking", "gardening"], "Bob": ["gaming", "coding", "sleeping"], "Charlie": ["traveling"]}

Notice how Alice has three hobbies, Bob has two and Charlie has only one. A map allows you to handle this variability without wasting space or creating empty values. Of course, you could also use a list, but then you would have to remember the order of the elements and deal with missing data. A map makes your life easier by letting you access the values by their keys.

In this library, maps are represented as an instance of generic IDictionary<TKey, TValue> type.

To give you a minimal example, let's say we have the following class with two properties: Id and Tags. The Id property is an integer that can be used to identify a row or an item in a collection. The Tags property is a dictionary of strings that can store arbitrary key-value pairs. For example, the Tags property can be used to store metadata or attributes of the item:

class IdWithTags { public int Id { get; set; } public Dictionary<string, string>? Tags { get; set; } }

You can easily use ParquetSerializer to work with this class:

var data = Enumerable.Range(0, 10).Select(i => new IdWithTags { Id = i, Tags = new Dictionary<string, string> { ["id"] = i.ToString(), ["gen"] = DateTime.UtcNow.ToString() }}).ToList(); await ParquetSerializer.SerializeAsync(data, "c:\\tmp\\map.parquet");

When read by Spark, the schema looks like the following:

root |-- Id: integer (nullable = true) |-- Tags: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true)

And the data:

+---+-------------------------------------+ |Id |Tags | +---+-------------------------------------+ |0 |{id -> 0, gen -> 17/03/2023 13:06:04}| |1 |{id -> 1, gen -> 17/03/2023 13:06:04}| |2 |{id -> 2, gen -> 17/03/2023 13:06:04}| |3 |{id -> 3, gen -> 17/03/2023 13:06:04}| |4 |{id -> 4, gen -> 17/03/2023 13:06:04}| |5 |{id -> 5, gen -> 17/03/2023 13:06:04}| |6 |{id -> 6, gen -> 17/03/2023 13:06:04}| |7 |{id -> 7, gen -> 17/03/2023 13:06:04}| |8 |{id -> 8, gen -> 17/03/2023 13:06:04}| |9 |{id -> 9, gen -> 17/03/2023 13:06:04}| +---+-------------------------------------+

Supported collection types

Similar to JSON supported collection types, here are collections Parquet.Net currently supports:

Type

Serialization

Deserialization

Single-dimensional array **

✔️

Muti-dimensional arrays *

IList<T>

✔️

**

List<T>

✔️

✔️

IDictionary<TKey, TValue> **

Dictionary<TKey, TValue>

✔️

✔️

* Technically impossible or very hard to implement. ** Technically possible, but not implemented yet.

Appending to files

ParquetSerializer supports appending data to an existing Parquet file. This can be useful when you have multiple batches of data that need to be written to the same file.

To use this feature, you need to set the Append flag to true in the ParquetSerializerOptions object that you pass to the SerializeAsync method. This will tell the library to append the data batch to the end of the file stream instead of overwriting it. For example:

await ParquetSerializer.SerializeAsync(dataBatch, ms, new ParquetSerializerOptions { Append = true });

However, there is one caveat: you should not set the Append flag to true for the first batch of data that you write to a new file. This is because a Parquet file has a header and a footer that contain metadata about the schema and statistics of the data. If you try to append data to an empty file stream, you will get an IOException because there is no header or footer to read from. Therefore, you should always set the Append flag to false for the first batch (or not pass any options, which makes it false by default) and then switch it to true for subsequent batches. For example:

// First batch await ParquetSerializer.SerializeAsync(dataBatch1, ms, new ParquetSerializerOptions { Append = false }); // Second batch await ParquetSerializer.SerializeAsync(dataBatch2, ms, new ParquetSerializerOptions { Append = true }); // Third batch await ParquetSerializer.SerializeAsync(dataBatch3, ms, new ParquetSerializerOptions { Append = true });

By following this pattern, you can easily append data to a Parquet file using ParquetSerializer.

Specifying row group size

Row groups are a logical division of data in a parquet file. They allow efficient filtering and scanning of data based on predicates. By default, all the class instances are serialized into a single row group, which is absolutely fine. If you need to set a custom row group size, you can specify it in ParquetSerializerOptions like so:

await ParquetSerializer.SerializeAsync(data, stream, new ParquetSerializerOptions { RowGroupSize = 10_000_000 });

Note that small row groups make parquet files very inefficient in general, so you should use this parameter only when you are absolutely sure what you are doing. For example, if you have a very large dataset that needs to be split into smaller files for distributed processing, you might want to use a smaller row group size to avoid having too many rows in one file. However, this will also increase the file size and the metadata overhead, so you should balance the trade-offs carefully.

FAQ

Q. Can I specify schema for serialisation/deserialization.

A. If you're using a class-based approach to define your data model, you don't have to worry about providing a schema separately. The class definition itself is the schema, meaning it specifies the fields and types of your data. This makes it easier to write and maintain your code, since you only have to define your data model once and use it everywhere.

Last modified: 16 April 2024