Export data from elasticsearch to mysql with Apache Spark

2 min readFeb 20, 2020

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
MySQL is an open-source relational database management system.
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Ok, now, you know these 3 things 👊 let’s play with it!

Here the use case:

“I have a ton of data in elasticsearch, with nested structure, and I want to export it into MySQL, of course, as fast as possible”

Why Spark?

Spark supports elasticsearch and MySQL very easily (just a plugin as a JAR file to provide)
Spark divides the work as stages/tasks, so it can work in parallel, with elasticsearch as the input, the “parallel unit” will be a shard, Spark will read data of all shards in the same time (1 task = 1 shard)
Spark can create MySQL schema for you! as a good developer, you are lazy, and always looking for automatic stuff, cool!
Spark can understand Python, Java or Scala

Apache Zeppelin

One more (very cool) techno is Apache Zeppelin:

Web-based notebook that enables data-driven,
interactive data analytics and collaborative documents with SQL, Scala and more.

Spark is seamlessly integrated, so we are gonna use Zep’ as a note book to write our tiny Spark (Scala) program.

The (Scala) code

Read elasticsearch data

Here we got a DataFrame from “clicks” index:

2. Flatten schema

Elasticsearch speaks JSON, it loves nested structure, whereas MySQL does not. So we must flat our data schema:

3. Write to MySQL

Export data from elasticsearch to mysql with Apache Spark

Why Spark?

Apache Zeppelin

The (Scala) code

Written by Thomas Decaux

No responses yet