At x.com analytics, we are moving our data lake from Hadoop to kubernetes + S3 Ceph storage.
On Hadoop, we used to run Djobi, a simple java application to run Apache Spark data pipeline jobs.
Djobi was really cool it did:
- workflow architecture (run X jobs from same YAML templates etc..)
- templating (inject variable into SQL queries)
- configuration (read YAML files )
- spark packaging (elasticsearch spark output was embedded)
- Monitoring / logging on elasticsearch
- run on Hadoop YARN
But now, we are on kubernetes, and all these cool features … are provided by the cool argo-workflow software!
For the moment, I don’t know yet if we will abandon djobi, this software took me months of work, it served us well for years. But I find in
k8s and argo-workflow exactly what I wanted to do with Djobi.
So I decide to explore the possibilities of argo-workflow with spark, I’m a bit tired of java, let’s try python (thank god, Spark supports both java and python, and many other runtimes).
For this article, the first of a long series, I will simply show how to launch a python script using the spark framework to display the schema of a parquet file located on HDFS.
The script will be run on a local spark, embedded in an argo workflow: