If you want to explore how to parallelize the data ingestion into Elasticsearch, please have a look to this post I have written for Amazon AWS:
It explains how to index Common Crawl metadata into Elasticsearch using Cascading connector directly from the S3 data source.
Cascading Source Code is available here.