Export data from elasticsearch to mysql with Apache Spark
Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
MySQL is an open-source relational database management system.
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Ok, now, you know these 3 things 👊 let’s play with it!
Here the use case:
“I have a ton of data in elasticsearch, with nested structure, and I want to export it into MySQL, of course, as fast as possible”
Why Spark?
- Spark supports elasticsearch and MySQL very easily (just a plugin as a JAR file to provide)
- Spark divides the work as stages/tasks, so it can work in parallel, with elasticsearch as the input, the “parallel unit” will be a shard, Spark will read data of all shards in the same time (1 task = 1 shard)
- Spark can create MySQL schema for you! as a good developer, you are lazy, and always looking for automatic stuff, cool!
- Spark can understand Python, Java or Scala
Apache Zeppelin
One more (very cool) techno is Apache Zeppelin:
Web-based notebook that enables data-driven,
interactive data analytics and collaborative documents with SQL, Scala and more.
Spark is seamlessly integrated, so we are gonna use Zep’ as a note book to write our tiny Spark (Scala) program.
The (Scala) code
- Read elasticsearch data
Here we got a DataFrame from “clicks” index:
2. Flatten schema
Elasticsearch speaks JSON, it loves nested structure, whereas MySQL does not. So we must flat our data schema:
3. Write to MySQL