Run pyspark script on kubernetes with argo-workflow

Thomas Decaux
2 min readNov 10, 2022

At x.com analytics, we are moving our data lake from Hadoop to kubernetes + S3 Ceph storage.

On Hadoop, we used to run Djobi, a simple java application to run Apache Spark data pipeline jobs.

Djobi was really cool it did:

  • workflow architecture (run X jobs from same YAML templates etc..)
  • templating (inject variable into SQL queries)
  • configuration (read YAML files )
  • spark packaging (elasticsearch spark output was embedded)
  • Monitoring / logging on elasticsearch
  • run on Hadoop YARN

But now, we are on kubernetes, and all these cool features … are provided by the cool argo-workflow software!

For the moment, I don’t know yet if we will abandon djobi, this software took me months of work, it served us well for years. But I find in
k8s and argo-workflow exactly what I wanted to do with Djobi.

So I decide to explore the possibilities of argo-workflow with spark, I’m a bit tired of java, let’s try python (thank god, Spark supports both java and python, and many other runtimes).

For this article, the first of a long series, I will simply show how to launch a python script using the spark framework to display the schema of a parquet file located on HDFS.

The script will be run on a local spark, embedded in an argo workflow:

--

--