note on using --groupBy
say u have 4 files named 01-05-2013,02-05-2013,01-06-2013 and 02-06-2013. These are files named in DD-mm-yyyy format.
use parenthesis to separate out the parts which are common.
Piping to one file
If you want to concatenate all files use --groupBy *(-0).(-2013).*
./elastic-mapreduce --jobflow j-14HZLRKA2CUZ7 --jar s3n://arun-emr-files/emr-s3distcp-1.0.jar --args '--src,s3n://arum-test-bucket/,--dest,hdfs:///test/3,--groupBy,.*(-0).(-2013).*'
hadoop@ip-172-31-34-171:~$ hadoop fs -ls /test/3
Found 1 items
-rw-r--r-- 3 hadoop supergroup 44 2013-12-27 22:53 /test/3/-0-20130 --> only one file
Found 1 items
-rw-r--r-- 3 hadoop supergroup 44 2013-12-27 22:53 /test/3/-0-20130 --> only one file
hadoop@ip-172-31-34-171:~$ hadoop fs -cat /test/3/-0-20130
1
2
3
4
5
6
10
11
6
7
12
13
8
9
14
15
17
18
1
2
3
4
5
6
10
11
6
7
12
13
8
9
14
15
17
18
Piping to different files for each group
If you want to separate out the concatenation of files for each month then use --groupBy,.*(-0.-2013).*'
./elastic-mapreduce --jobflow j-14HZLRKA2CUZ7 --jar s3n://arun-emr-files/emr-s3distcp-1.0.jar --args '--src,s3n://arum-test-bucket/,--dest,hdfs:///test/2,--groupBy,.*(-0.-2013).*'
hadoop@ip-172-31-34-171:~$ hadoop fs -ls /test/2
Found 2 items
-rw-r--r-- 3 hadoop supergroup 20 2013-12-27 22:50 /test/2/-05-20130
-rw-r--r-- 3 hadoop supergroup 24 2013-12-27 22:50 /test/2/-06-20130
Found 2 items
-rw-r--r-- 3 hadoop supergroup 20 2013-12-27 22:50 /test/2/-05-20130
-rw-r--r-- 3 hadoop supergroup 24 2013-12-27 22:50 /test/2/-06-20130
hadoop@ip-172-31-34-171:~$ hadoop fs -cat /test/1/-05-20130
1
2
3
4
5
6
6
7
8
9
hadoop@ip-172-31-34-171:~$ hadoop fs -cat /test/1/-06-20130
10
11
12
13
14
15
17
18
1
2
3
4
5
6
6
7
8
9
hadoop@ip-172-31-34-171:~$ hadoop fs -cat /test/1/-06-20130
10
11
12
13
14
15
17
18
Notes from AWS support
The wildcards inside the parenthesis are usable and whatever they match out to will be the potential groups.
For example, say you have the filenames:
hosta-subprocess1-2013-06-01.log
hosta-subprocess2-2013-06-02.log
hostb-subprocess1-2013-06-01.log
hostb-subprocess2-2013-06-02.log
hostc-subprocess1-2013-06-01.log
using group by .*(subprocess).* would result in a concatenation of all files into a single file.
Using group by (.*subprocess).* would result in 3 files:
hosta-subprocess (which includes hosta-subprocess1-2013-06-01.log and hosta-subprocess2-2013-06-02.log)
hostb-subprocess (which includes hostb-subprocess1-2013-06-01.log and hostb-subprocess2-2013-06-02.log)
hostc-subprocess (only hostc-subprocess1-2013-06-01.log)
Using group by .*(\d+-\d+-\d+).* would result in 2 files:
2013-06-01 (which includes hosta-subprocess1-2013-06-01.log, hostb-subprocess1-2013-06-01.log, hostc-subprocess1-2013-06-01.log)
2013-06-02 (which includes hosta-subprocess2-2013-06-02.log and hostb-subprocess2-2013-06-02.log)