These are two examples of how to copy data from one S3 location to other S3 location. Same operation can be done from S3 to HDFS and vice-versa.
I’m considering that you are able to launch the Hive client or spark-shell client.
Hive:
Using Mapreduce engine or Tez engine:
set hive.execution.engine=mr;
or
set hive.execution.engine=tez;
CREATE EXTERNAL TABLE source_table(a_col string, b_col string, c_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://mybucket/hive/csv/'; CREATE TABLE destination_table(a_col string, b_col string, c_col string) LOCATION 's3://mybucket/output-hive/csv_1/'; INSERT OVERWRITE TABLE destination_table SELECT * FROM source_table;
Spark:
sc.textFile("s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/segments/1387346051826/warc/").saveAsTextFile("s3://mybucket/spark/bigfiles")
If you want to copy data to HDFS, you can also explore s3-dist-cp:
s3DistCP:
s3-dist-cp --src s3://mybucket/hive/csv/ --dest=hdfs:///output-hive/csv_10/