본문 바로가기
  • 문과생의 백엔드 개발자 성장기
|Playdata_study/HADOOP

210716_HADOOP(MR작업)

by 케리's 2021. 7. 16.

MapReduce docs

 

https://hadoop.apache.org/docs/r2.10.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v2.0

 

Apache Hadoop 2.10.1 – MapReduce Tutorial

 

 

이전 시간에 test 폴더 내에 file01, file 02 올린 상태로 진행하는 내용이다. (210715 내용 참고)

 

 

1. test 폴더 하위에 wc.jar 파일을 저장한다 

 

  wc.jar 파일은 위의 docs 의 source code를 다운받아 jar 파일로 export 하면된다. 

  (이클립스에 java source code 저장 후 export 하고 저장된 jar 파일을 공용 폴더에 넣고

   hadoop root계정에서 test 폴더 하위에 저장한다.)

 

 

 

 

 

 

 

 

 

2. 디렉토리 올리기 

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop jar wc.jar /user/joe/wordcount/input /user/joe/wordcount/output

 

실행화면 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop jar wc.jar /user/joe/wordcount/input /user/joe/wordcount/output
21/07/16 20:24:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/07/16 20:24:33 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.56.116:8040
21/07/16 20:24:37 INFO input.FileInputFormat: Total input files to process : 2
21/07/16 20:24:37 INFO mapreduce.JobSubmitter: number of splits:2
21/07/16 20:24:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1626432687117_0001
21/07/16 20:24:38 INFO conf.Configuration: resource-types.xml not found
21/07/16 20:24:38 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/07/16 20:24:38 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
21/07/16 20:24:38 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
21/07/16 20:24:39 INFO impl.YarnClientImpl: Submitted application application_1626432687117_0001
21/07/16 20:24:39 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1626432687117_0001/
21/07/16 20:24:39 INFO mapreduce.Job: Running job: job_1626432687117_0001
21/07/16 20:24:59 INFO mapreduce.Job: Job job_1626432687117_0001 running in uber mode : false
21/07/16 20:24:59 INFO mapreduce.Job:  map 0% reduce 0%
21/07/16 20:25:15 INFO mapreduce.Job:  map 100% reduce 0%
21/07/16 20:25:30 INFO mapreduce.Job:  map 100% reduce 100%
21/07/16 20:25:30 INFO mapreduce.Job: Job job_1626432687117_0001 completed successfully
21/07/16 20:25:30 INFO mapreduce.Job: Counters: 50
	File System Counters
		FILE: Number of bytes read=117
		FILE: Number of bytes written=625427
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=292
		HDFS: Number of bytes written=67
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=27787
		Total time spent by all reduces in occupied slots (ms)=10943
		Total time spent by all map tasks (ms)=27787
		Total time spent by all reduce tasks (ms)=10943
		Total vcore-milliseconds taken by all map tasks=27787
		Total vcore-milliseconds taken by all reduce tasks=10943
		Total megabyte-milliseconds taken by all map tasks=28453888
		Total megabyte-milliseconds taken by all reduce tasks=11205632
	Map-Reduce Framework
		Map input records=3
		Map output records=9
		Map output bytes=93
		Map output materialized bytes=123
		Input split bytes=234
		Combine input records=9
		Combine output records=9
		Reduce input groups=8
		Reduce shuffle bytes=123
		Reduce input records=9
		Reduce output records=8
		Spilled Records=18
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=570
		CPU time spent (ms)=2360
		Physical memory (bytes) snapshot=546463744
		Virtual memory (bytes) snapshot=6203805696
		Total committed heap usage (bytes)=301146112
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	com.test.WordCount2$TokenizerMapper$CountersEnum
		INPUT_WORDS=9
	File Input Format Counters 
		Bytes Read=58
	File Output Format Counters 
		Bytes Written=67

 

3. MR 작업 된 파일 output 읽어 오기 

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000

 

실행결과 

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000
21/07/16 20:26:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Bye	1
Goodbye	1
Hadoop,	1
Hello	2
World!	1
World,	1
hadoop.	1
to	1

 

4. 위 내용에서 필요없는 단어 ( , . ! to) 지울 내용을 pattern.txt 로 만들기

 


1) 비주얼로 home/hadoop/test/pattern.txt 만들고 아래 내용을 입력 후 저장

 

\.
\,
\!
to

 

 

 

5. pattern.txt 파일을 디렉토리에 올리기 (잘 올라가지 않는다면 50070 에 직접 비주얼로 올림)

 

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop fs -put pattern.txt /user/joe/wordcount/input

 

 

5. 디렉토리 올린 pattern.txt 파일을 읽기 

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop fs -cat /user/joe/wordcount/input/pattern.txt

 

실행파일

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop fs -cat /user/joe/wordcount/input/pattern.txt
21/07/16 20:31:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
\.
\,
\!
to

 

 

 

6. file01, file02 txt 파일 내의 문장을  mr 작업 후에

   대/소문자를 구분해서 출력해보고 구분하지 않고 모두 소문자로 변환 후 출력해보자.

   (Dwordcount 사용)

 

 

1) 대/ 소문자를 구분해서 출력하기 = true

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop jar wc.jar -Dwordcount.case.sensitive=true /user/joe/wordcount/input /user/joe/wordcount/output1 -skip /user/joe/wordcount/pattern.txt

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop fs -cat /user/joe/wordcount/output1/part-r-00000

 

실행 화면 

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop jar wc.jar -Dwordcount.case.sensitive=true /user/joe/wordcount/input /user/joe/wordcount/output1 -skip /user/joe/wordcount/pattern.txt
21/07/16 20:43:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/07/16 20:43:19 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.56.116:8040
21/07/16 20:43:22 INFO input.FileInputFormat: Total input files to process : 2
21/07/16 20:43:22 INFO mapreduce.JobSubmitter: number of splits:2
21/07/16 20:43:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1626432687117_0004
21/07/16 20:43:23 INFO conf.Configuration: resource-types.xml not found
21/07/16 20:43:23 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/07/16 20:43:23 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
21/07/16 20:43:23 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
21/07/16 20:43:24 INFO impl.YarnClientImpl: Submitted application application_1626432687117_0004
21/07/16 20:43:24 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1626432687117_0004/
21/07/16 20:43:24 INFO mapreduce.Job: Running job: job_1626432687117_0004
21/07/16 20:43:38 INFO mapreduce.Job: Job job_1626432687117_0004 running in uber mode : false
21/07/16 20:43:38 INFO mapreduce.Job:  map 0% reduce 0%
21/07/16 20:43:59 INFO mapreduce.Job:  map 100% reduce 0%
21/07/16 20:44:11 INFO mapreduce.Job:  map 100% reduce 100%
21/07/16 20:44:12 INFO mapreduce.Job: Job job_1626432687117_0004 completed successfully
21/07/16 20:44:12 INFO mapreduce.Job: Counters: 50
	File System Counters
		FILE: Number of bytes read=92
		FILE: Number of bytes written=629259
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=292
		HDFS: Number of bytes written=50
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=36721
		Total time spent by all reduces in occupied slots (ms)=8984
		Total time spent by all map tasks (ms)=36721
		Total time spent by all reduce tasks (ms)=8984
		Total vcore-milliseconds taken by all map tasks=36721
		Total vcore-milliseconds taken by all reduce tasks=8984
		Total megabyte-milliseconds taken by all map tasks=37602304
		Total megabyte-milliseconds taken by all reduce tasks=9199616
	Map-Reduce Framework
		Map input records=3
		Map output records=8
		Map output bytes=82
		Map output materialized bytes=98
		Input split bytes=234
		Combine input records=8
		Combine output records=7
		Reduce input groups=6
		Reduce shuffle bytes=98
		Reduce input records=7
		Reduce output records=6
		Spilled Records=14
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=790
		CPU time spent (ms)=4120
		Physical memory (bytes) snapshot=598175744
		Virtual memory (bytes) snapshot=6201077760
		Total committed heap usage (bytes)=301146112
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	com.test.WordCount2$TokenizerMapper$CountersEnum
		INPUT_WORDS=8
	File Input Format Counters 
		Bytes Read=58
	File Output Format Counters 
		Bytes Written=50

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop fs -cat /user/joe/wordcount/output1/part-r-00000
21/07/16 20:49:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Bye	1
Goodbye	1
Hadoop	1
Hello	2
World	2
hadoop	1

 

2) 대/ 소문자를 모두 소문자로 변환 후 출력하기 = false

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop jar wc.jar -Dwordcount.case.sensitive=false /user/joe/wordcount/input /user/joe/wordcount/output2 -skip /user/joe/wordcount/pattern.txt
[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop fs -cat /user/joe/wordcount/output2/part-r-00000

 

실행화면 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop jar wc.jar -Dwordcount.case.sensitive=false /user/joe/wordcount/input /user/joe/wordcount/output2 -skip /user/joe/wordcount/pattern.txt
21/07/16 20:51:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/07/16 20:51:21 INFO client.RMProxy: Connecting to ResourceManager at hadoop01/192.168.56.116:8040
21/07/16 20:51:25 INFO input.FileInputFormat: Total input files to process : 2
21/07/16 20:51:25 INFO mapreduce.JobSubmitter: number of splits:2
21/07/16 20:51:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1626432687117_0005
21/07/16 20:51:26 INFO conf.Configuration: resource-types.xml not found
21/07/16 20:51:26 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/07/16 20:51:26 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
21/07/16 20:51:26 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
21/07/16 20:51:26 INFO impl.YarnClientImpl: Submitted application application_1626432687117_0005
21/07/16 20:51:26 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1626432687117_0005/
21/07/16 20:51:26 INFO mapreduce.Job: Running job: job_1626432687117_0005
21/07/16 20:51:40 INFO mapreduce.Job: Job job_1626432687117_0005 running in uber mode : false
21/07/16 20:51:40 INFO mapreduce.Job:  map 0% reduce 0%
21/07/16 20:51:56 INFO mapreduce.Job:  map 100% reduce 0%
21/07/16 20:52:07 INFO mapreduce.Job:  map 100% reduce 100%
21/07/16 20:52:07 INFO mapreduce.Job: Job job_1626432687117_0005 completed successfully
21/07/16 20:52:07 INFO mapreduce.Job: Counters: 50
	File System Counters
		FILE: Number of bytes read=79
		FILE: Number of bytes written=629236
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=292
		HDFS: Number of bytes written=41
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=25850
		Total time spent by all reduces in occupied slots (ms)=8221
		Total time spent by all map tasks (ms)=25850
		Total time spent by all reduce tasks (ms)=8221
		Total vcore-milliseconds taken by all map tasks=25850
		Total vcore-milliseconds taken by all reduce tasks=8221
		Total megabyte-milliseconds taken by all map tasks=26470400
		Total megabyte-milliseconds taken by all reduce tasks=8418304
	Map-Reduce Framework
		Map input records=3
		Map output records=8
		Map output bytes=82
		Map output materialized bytes=85
		Input split bytes=234
		Combine input records=8
		Combine output records=6
		Reduce input groups=5
		Reduce shuffle bytes=85
		Reduce input records=6
		Reduce output records=5
		Spilled Records=12
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=633
		CPU time spent (ms)=2500
		Physical memory (bytes) snapshot=571346944
		Virtual memory (bytes) snapshot=6201217024
		Total committed heap usage (bytes)=301146112
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	com.test.WordCount2$TokenizerMapper$CountersEnum
		INPUT_WORDS=8
	File Input Format Counters 
		Bytes Read=58
	File Output Format Counters 
		Bytes Written=41

 

[hadoop@hadoop01 test]$ $HADOOP_HOME/bin/hadoop fs -cat /user/joe/wordcount/output2/part-r-00000
21/07/16 20:52:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
bye	1
goodbye	1
hadoop	2
hello	2
world	2

 

 

HDFS - Browse Directoy 확인 

 

댓글