Use snappy compression native when running Spark on Alpine
Snappy is a very good compression algo., used by Spark to write & read smaller parquet files. Spark use a java library “xerial”
JNI-based implementation to achieve comparable performance to the native C++ version. Snappy-java contains native libraries built for Window/Mac/Linux, etc. snappy-java loads one of these libraries according to your machine environment (It looks system properties,
os.name
andos.arch
).If no native library for your platform is found, snappy-java will fallback to pure-java implementation.
When using Alpine OS, very common with Docker and kubernetes, because Alpine dont have glic by default, snappy-java will fallback to pure java implementation, which can be broken.
Solution
1. Install native snappy
Install this package https://pkgs.alpinelinux.org/package/edge/community/x86_64/java-snappy-native
2. Tell to Spark executor to use the system library
According https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/SnappyLoader.java :