Monday, March 31, 2014

Note to myself on S3Distcp


note on using --groupBy

say u have 4 files named 01-05-2013,02-05-2013,01-06-2013 and 02-06-2013. These are files named in DD-mm-yyyy format.
use parenthesis to separate out the parts which are common.

Piping to one file

 If you want to concatenate all files use --groupBy *(-0).(-2013).*
hadoop@ip-172-31-34-171:~$ hadoop fs -ls /test/3
Found 1 items
-rw-r--r-- 3 hadoop supergroup 44 2013-12-27 22:53 /test/3/-0-20130   --> only one file
hadoop@ip-172-31-34-171:~$ hadoop fs -cat /test/3/-0-20130
1
2
3
4
5
6
10
11
6
7
12
13
8
9
14
15
17
18

Piping to different files for each group

If you want to separate out the concatenation of files for each month then use --groupBy,.*(-0.-2013).*'
 ./elastic-mapreduce --jobflow j-14HZLRKA2CUZ7  --jar s3n://arun-emr-files/emr-s3distcp-1.0.jar --args '--src,s3n://arum-test-bucket/,--dest,hdfs:///test/2,--groupBy,.*(-0.-2013).*'
hadoop@ip-172-31-34-171:~$ hadoop fs -ls /test/2
Found 2 items
-rw-r--r-- 3 hadoop supergroup 20 2013-12-27 22:50 /test/2/-05-20130
-rw-r--r-- 3 hadoop supergroup 24 2013-12-27 22:50 /test/2/-06-20130
hadoop@ip-172-31-34-171:~$ hadoop fs -cat /test/1/-05-20130
1
2
3
4
5
6
6
7
8
9
hadoop@ip-172-31-34-171:~$ hadoop fs -cat /test/1/-06-20130
10
11
12
13
14
15
17
18

Notes from AWS support


The wildcards inside the parenthesis are usable and whatever they match out to will be the potential groups.

For example, say you have the filenames:
hosta-subprocess1-2013-06-01.log 
hosta-subprocess2-2013-06-02.log
hostb-subprocess1-2013-06-01.log
hostb-subprocess2-2013-06-02.log
hostc-subprocess1-2013-06-01.log

using group by .*(subprocess).* would result in a concatenation of all files into a single file.

Using group by (.*subprocess).* would result in 3 files:
hosta-subprocess (which includes hosta-subprocess1-2013-06-01.log and hosta-subprocess2-2013-06-02.log)
hostb-subprocess (which includes hostb-subprocess1-2013-06-01.log and hostb-subprocess2-2013-06-02.log)
hostc-subprocess (only hostc-subprocess1-2013-06-01.log)

Using group by .*(\d+-\d+-\d+).* would result in 2 files:
2013-06-01 (which includes hosta-subprocess1-2013-06-01.log, hostb-subprocess1-2013-06-01.log, hostc-subprocess1-2013-06-01.log)
2013-06-02 (which includes hosta-subprocess2-2013-06-02.log and hostb-subprocess2-2013-06-02.log)

No comments: