Export data from elasticsearch to mysql with Apache Spark

Thomas Decaux
2 min readFeb 20, 2020

--

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

MySQL is an open-source relational database management system.

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Ok, now, you know these 3 things 👊 let’s play with it!

Here the use case:

“I have a ton of data in elasticsearch, with nested structure, and I want to export it into MySQL, of course, as fast as possible”

Why Spark?

  • Spark supports elasticsearch and MySQL very easily (just a plugin as a JAR file to provide)
  • Spark divides the work as stages/tasks, so it can work in parallel, with elasticsearch as the input, the “parallel unit” will be a shard, Spark will read data of all shards in the same time (1 task = 1 shard)
  • Spark can create MySQL schema for you! as a good developer, you are lazy, and always looking for automatic stuff, cool!
  • Spark can understand Python, Java or Scala

Apache Zeppelin

One more (very cool) techno is Apache Zeppelin:

Web-based notebook that enables data-driven,
interactive data analytics and collaborative documents with SQL, Scala and more.

Spark is seamlessly integrated, so we are gonna use Zep’ as a note book to write our tiny Spark (Scala) program.

The (Scala) code

  1. Read elasticsearch data

Here we got a DataFrame from “clicks” index:

2. Flatten schema

Elasticsearch speaks JSON, it loves nested structure, whereas MySQL does not. So we must flat our data schema:

3. Write to MySQL

--

--

Thomas Decaux
Thomas Decaux

No responses yet