Use snappy compression native when running Spark on Alpine

1 min readJul 2, 2022

Snappy is a very good compression algo., used by Spark to write & read smaller parquet files. Spark use a java library “xerial”

JNI-based implementation to achieve comparable performance to the native C++ version. Snappy-java contains native libraries built for Window/Mac/Linux, etc. snappy-java loads one of these libraries according to your machine environment (It looks system properties, os.name and os.arch).
If no native library for your platform is found, snappy-java will fallback to pure-java implementation.

When using Alpine OS, very common with Docker and kubernetes, because Alpine dont have glic by default, snappy-java will fallback to pure java implementation, which can be broken.

read a parquet file using Java, but it works in local machine, and doesn't work in docker container

I have a requirement to read parquet files and publish to Kafka in a Java standalone application. I have the below code…

stackoverflow.com

Solution

1. Install native snappy

Install this package https://pkgs.alpinelinux.org/package/edge/community/x86_64/java-snappy-native

2. Tell to Spark executor to use the system library

According https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/SnappyLoader.java :