MapReduce 를 이용한 단어 갯수 확인

Posted 04 5, 2013 09:56, Filed under: BigData/MapReduce



# 프로파일에 Java 환경 설정


[hadoop@master ~]$ cat ~/.bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs

JAVA_HOME=/home/hadoop/jdk1.7.0_17
export JAVA_HOME

HADOOP_CORE=/home/hadoop/hadoop-1.1.2/hadoop-core-1.1.2.jar
export HADOOP_CORE

export CLASS_PATH=.:$HADOOP_CORE

PATH=$PATH:$HOME/bin:$JAVA_HOME/bin

export PATH


# MapReduce를 이용한 단어 갯수 세기

/* 업로드 할 디렉토리 생성*/
[hadoop@master bin]$ ./hadoop fs -mkdir /user/input

/* 텍스트 파일 생성 */
[hadoop@master bin]$ echo "Hello World Bye World" > file0

[hadoop@master bin]$ echo "Hello Hadoop Goodbye Hadoop" > file1

/* 파일을 업로드 */
[hadoop@master bin]$ ./hadoop fs -put /home/hadoop/file* /user/input

[hadoop@master bin]$ ./hadoop fs -put /home/hadoop/hadoop.txt /user/input

/* 업로드한 파일을 조회 */
[hadoop@master bin]$ ./hadoop fs -ls /user/input
Found 3 items
-rw-r--r--   2 hadoop supergroup         22 2013-04-04 20:55 /user/input/file0
-rw-r--r--   2 hadoop supergroup         28 2013-04-04 20:55 /user/input/file1
-rw-r--r--   2 hadoop supergroup       2463 2013-04-04 20:55 /user/input/hadoop.txt

/* 프로파일에 설정한 CLASS_PATH 를 이용하여 WordCount 파일을 컴파일 */
[hadoop@master ~]$ javac -cp $CLASS_PATH -d wordcount /home/hadoop/WordCount.java

/* 컴파일된 클래스, 패키지를 jar 파일로 생성 */
[hadoop@master ~]$ jar -cvf wordcount.jar -C wordcount/ .
Manifest를 추가함
추가하는 중: hadoop/(입력 = 0) (출력 = 0)(0%를 저장함)
추가하는 중: hadoop/mr/(입력 = 0) (출력 = 0)(0%를 저장함)
추가하는 중: hadoop/mr/WordCount$Map.class(입력 = 1938) (출력 = 796)(58%를 감소함)
추가하는 중: hadoop/mr/WordCount.class(입력 = 1546) (출력 = 746)(51%를 감소함)
추가하는 중: hadoop/mr/WordCount$Reduce.class(입력 = 1611) (출력 = 648)(59%를 감

/* 
  # jar 파일을 이용하여 업로드한 파일들의 단어 갯수 를 추출하여 output 디렉토리에 저장

  Map > Shuffle(정렬, 파티셔닝) > Reduce

  셔플이 수행하는 일은 맵 태스크 과정에서 만들어진 (키, 값)쌍들을 특정키에 대해서 그룹화하거나 
정렬하는 역할을 수행.
  파티셔닝은 셔플 과정을 통해서 그룹화되거나 정렬된 (키, 값)쌍들을 리듀스 태스크에 전달하기 
위해서 특정 기준에 의해서 파티셔닝(쪼개는) 역할을 한다.
*/
[hadoop@master bin]$ ./hadoop jar /home/hadoop/wordcount.jar hadoop.mr.WordCount /user/input 
/user/output
13/04/04 21:20:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. 
Applications should implement Tool for the same.
13/04/04 21:20:34 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/04/04 21:20:34 WARN snappy.LoadSnappy: Snappy native library not loaded
13/04/04 21:20:34 INFO mapred.FileInputFormat: Total input paths to process : 3
13/04/04 21:20:35 INFO mapred.JobClient: Running job: job_201304041632_0005
13/04/04 21:20:36 INFO mapred.JobClient:  map 0% reduce 0%
13/04/04 21:21:12 INFO mapred.JobClient:  map 50% reduce 0%
13/04/04 21:21:46 INFO mapred.JobClient:  map 75% reduce 0%
13/04/04 21:21:47 INFO mapred.JobClient:  map 100% reduce 0%
13/04/04 21:21:48 INFO mapred.JobClient:  map 100% reduce 16%
13/04/04 21:21:51 INFO mapred.JobClient:  map 100% reduce 33%
13/04/04 21:21:52 INFO mapred.JobClient:  map 100% reduce 100%
13/04/04 21:21:58 INFO mapred.JobClient: Job complete: job_201304041632_0005
13/04/04 21:21:59 INFO mapred.JobClient: Counters: 30
13/04/04 21:21:59 INFO mapred.JobClient:   Job Counters 
13/04/04 21:21:59 INFO mapred.JobClient:     Launched reduce tasks=1
13/04/04 21:21:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=129816
13/04/04 21:21:59 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots 
(ms)=0
13/04/04 21:21:59 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/04/04 21:21:59 INFO mapred.JobClient:     Launched map tasks=4
13/04/04 21:21:59 INFO mapred.JobClient:     Data-local map tasks=4
13/04/04 21:21:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=39428
13/04/04 21:21:59 INFO mapred.JobClient:   File Input Format Counters 
13/04/04 21:21:59 INFO mapred.JobClient:     Bytes Read=3720
13/04/04 21:21:59 INFO mapred.JobClient:   File Output Format Counters 
13/04/04 21:21:59 INFO mapred.JobClient:     Bytes Written=2203
13/04/04 21:21:59 INFO mapred.JobClient:   FileSystemCounters
13/04/04 21:21:59 INFO mapred.JobClient:     FILE_BYTES_READ=3316
13/04/04 21:21:59 INFO mapred.JobClient:     HDFS_BYTES_READ=4122
13/04/04 21:21:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=256875
13/04/04 21:21:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2203
13/04/04 21:21:59 INFO mapred.JobClient:   Map-Reduce Framework
13/04/04 21:21:59 INFO mapred.JobClient:     Map output materialized bytes=3334
13/04/04 21:21:59 INFO mapred.JobClient:     Map input records=35
13/04/04 21:21:59 INFO mapred.JobClient:     Reduce shuffle bytes=3334
13/04/04 21:21:59 INFO mapred.JobClient:     Spilled Records=468
13/04/04 21:21:59 INFO mapred.JobClient:     Map output bytes=3810
13/04/04 21:21:59 INFO mapred.JobClient:     CPU time spent (ms)=15660
13/04/04 21:21:59 INFO mapred.JobClient:     Total committed heap usage (bytes)=480002048
13/04/04 21:21:59 INFO mapred.JobClient:     Map input bytes=2513
13/04/04 21:21:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=402
13/04/04 21:21:59 INFO mapred.JobClient:     Combine input records=339
13/04/04 21:21:59 INFO mapred.JobClient:     Reduce input records=234
13/04/04 21:21:59 INFO mapred.JobClient:     Reduce input groups=213
13/04/04 21:21:59 INFO mapred.JobClient:     Combine output records=234
13/04/04 21:21:59 INFO mapred.JobClient:     Physical memory (bytes) snapshot=602525696
13/04/04 21:21:59 INFO mapred.JobClient:     Reduce output records=213
13/04/04 21:21:59 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1736323072
13/04/04 21:21:59 INFO mapred.JobClient:     Map output records=339

/* output 디렉토리에 생성된 파일 확인 */
[hadoop@master bin]$ ./hadoop fs -ls /user/output
Found 3 items
-rw-r--r--   2 hadoop supergroup          0 2013-04-04 21:21 /user/output/_SUCCESS
drwxr-xr-x   - hadoop supergroup          0 2013-04-04 21:20 /user/output/_logs
-rw-r--r--   2 hadoop supergroup       2203 2013-04-04 21:21 /user/output/part-00000

/* Map > Shuffle > Reduce 과정으로 최종 저장된 파일 내용 확인 */
[hadoop@master bin]$ ./hadoop fs -cat /user/output/part-00000
Bye   1
World 2
Goodbye 1
Hadoop  14
Hello 2
Hive  1
ZooKeeper,   1
Apache  4
cluster 3
clusters 2
collection   1
common  1
computation  1
computation. 1
computers    1
computers,   1
computing.   1
coordination 1
dashboard    1
data  8
distributed  6


# 위의 결과에서 Map, Reduce 정보 확인

Map input records=35
Map output records=339

Combine output records=234
Reduce input records=234
Reduce input groups=213
Reduce output records=213


# WordCount.java 파일 정보

package hadoop.mr;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCount {

    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, 
IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter 
reporter) throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                output.collect(word, one);
            }
        }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, 
IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);
    }
}



# Hadoop 기본 제공 예제 wordcount 테스트

// hadoop-env.sh 파일을 hadoop 분산 디렉토리 conf 에 저장
[hadoop@master bin]$ ./hadoop fs -put ../conf/hadoop-env.sh conf/hadoop-env.sh
[hadoop@master bin]$ ./hadoop fs -lsr conf
-rw-r--r--   2 hadoop supe제group       2340 2013-05-11 14:09 /user/hadoop/conf/hadoop-env.sh

// hadoop-examples.jar 파일을 이용하여 wordcount MapReduce 분석 실행
[hadoop@master bin]$ ./hadoop jar ../hadoop-examples-*.jar wordcount conf/hadoop-env.sh 
wordcount_output
13/05/11 14:10:59 INFO input.FileInputFormat: Total input paths to process : 1
13/05/11 14:10:59 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/05/11 14:10:59 WARN snappy.LoadSnappy: Snappy native library not loaded
13/05/11 14:10:59 INFO mapred.JobClient: Running job: job_201305101027_0001
13/05/11 14:11:00 INFO mapred.JobClient:  map 0% reduce 0%
13/05/11 14:11:04 INFO mapred.JobClient:  map 100% reduce 0%
13/05/11 14:11:12 INFO mapred.JobClient:  map 100% reduce 100%
13/05/11 14:11:12 INFO mapred.JobClient: Job complete: job_201305101027_0001
13/05/11 14:11:12 INFO mapred.JobClient: Counters: 29
13/05/11 14:11:12 INFO mapred.JobClient:   Job Counters 
13/05/11 14:11:12 INFO mapred.JobClient:     Launched reduce tasks=1
13/05/11 14:11:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=2611
13/05/11 14:11:12 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots
 (ms)=0
13/05/11 14:11:12 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/05/11 14:11:12 INFO mapred.JobClient:     Launched map tasks=1
13/05/11 14:11:12 INFO mapred.JobClient:     Data-local map tasks=1
13/05/11 14:11:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=7965
13/05/11 14:11:12 INFO mapred.JobClient:   File Output Format Counters 
13/05/11 14:11:12 INFO mapred.JobClient:     Bytes Written=2178
13/05/11 14:11:12 INFO mapred.JobClient:   FileSystemCounters
13/05/11 14:11:12 INFO mapred.JobClient:     FILE_BYTES_READ=2834
13/05/11 14:11:12 INFO mapred.JobClient:     HDFS_BYTES_READ=2463
13/05/11 14:11:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=106096
13/05/11 14:11:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2178
13/05/11 14:11:12 INFO mapred.JobClient:   File Input Format Counters 
13/05/11 14:11:12 INFO mapred.JobClient:     Bytes Read=2340
13/05/11 14:11:12 INFO mapred.JobClient:   Map-Reduce Framework
13/05/11 14:11:12 INFO mapred.JobClient:     Map output materialized bytes=2834
13/05/11 14:11:12 INFO mapred.JobClient:     Map input records=58
13/05/11 14:11:12 INFO mapred.JobClient:     Reduce shuffle bytes=2834
13/05/11 14:11:12 INFO mapred.JobClient:     Spilled Records=326
13/05/11 14:11:12 INFO mapred.JobClient:     Map output bytes=3381
13/05/11 14:11:12 INFO mapred.JobClient:     Total committed heap usage (bytes)=223805440
13/05/11 14:11:12 INFO mapred.JobClient:     CPU time spent (ms)=990
13/05/11 14:11:12 INFO mapred.JobClient:     Combine input records=267
13/05/11 14:11:12 INFO mapred.JobClient:     SPLIT_RAW_BYTES=123
13/05/11 14:11:12 INFO mapred.JobClient:     Reduce input records=163
13/05/11 14:11:12 INFO mapred.JobClient:     Reduce input groups=163
13/05/11 14:11:12 INFO mapred.JobClient:     Combine output records=163
13/05/11 14:11:12 INFO mapred.JobClient:     Physical memory (bytes) snapshot=247476224
13/05/11 14:11:12 INFO mapred.JobClient:     Reduce output records=163
13/05/11 14:11:12 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=803270656
13/05/11 14:11:12 INFO mapred.JobClient:     Map output records=267

// 분석된 문자 수 정보 확인
hadoop@master bin]$ ./hadoop fs -cat wordcount_output/part-r-00000
#       34
$HADOOP_BALANCER_OPTS"  1
$HADOOP_DATANODE_OPTS"  1
$HADOOP_HOME/conf/slaves        1
$HADOOP_HOME/logs       1
$HADOOP_JOBTRACKER_OPTS"        1
$HADOOP_NAMENODE_OPTS"  1
$HADOOP_SECONDARYNAMENODE_OPTS" 1
$USER   1
'man    1
(fs,    1
-o      1
/tmp    1
1000.   1
A       1
All     1
CLASSPATH       1
Command 1
ConnectTimeout=1        1
Default 1
Empty   2
Extra   3
File    1
HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote    1
HADOOP_CLASSPATH=       1
HADOOP_CLIENT_OPTS      1
HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote    1
HADOOP_HEAPSIZE=2000    1
HADOOP_HOME=/home/hadoop/hadoop-1.1.2   1
HADOOP_HOME_WARN_SUPPRESS="TRUE"        1
HADOOP_IDENT_STRING=$USER       1
HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote  1
HADOOP_LOG_DIR=${HADOOP_HOME}/logs      1
HADOOP_MASTER=master:/home/$USER/src/hadoop     1
HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote    1
HADOOP_NICENESS=10      1
HADOOP_OPTS     1
HADOOP_OPTS=-server     1
HADOOP_PID_DIR=/var/hadoop/pids 1


※ 위 내용은, 여러 자료를 참고하거나 제가 주관적으로 정리한 것입니다.
   잘못된 정보나 보완이 필요한 부분을, 댓글 또는 메일로 보내주시면 많은 도움이 되겠습니다.
04 5, 2013 09:56 04 5, 2013 09:56


Trackback URL : http://develop.sunshiny.co.kr/trackback/871

Leave a comment

« Previous : 1 : ... 144 : 145 : 146 : 147 : 148 : 149 : 150 : 151 : 152 : ... 648 : Next »

Recent Posts

  1. HDFS - Python Encoding 오류 처리
  2. HP - Vertica ROS Container 관련 오류...
  3. HDFS - Hive 실행시 System Time 오류
  4. HP - Vertica 사용자 쿼리 이력 테이블...
  5. Client에서 HDFS 환경의 데이터 처리시...

Recent Comments

  1. 안녕하세요^^ 배그핵
  2. 안녕하세요^^ 도움이 되셨다니, 저... sunshiny
  3. 정말 큰 도움이 되었습니다.. 감사합... 사랑은
  4. 네, 안녕하세요. 댓글 남겨 주셔서... sunshiny
  5. 감사합니다 많은 도움 되었습니다!ㅎㅎ 프리시퀸스

Recent Trackbacks

  1. prefab steel buildings prefab steel buildings %M
  2. Mysql - mysql 설치후 Character set... 멀고 가까움이 다르기 때문 %M

Calendar

«   09 2019   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          

Bookmarks

  1. 위키피디아
  2. MysqlKorea
  3. 오라클 클럽
  4. API - Java
  5. Apache Hadoop API
  6. Apache Software Foundation
  7. HDFS 생태계 솔루션
  8. DNSBL - Spam Database Lookup
  9. Ready System
  10. Solaris Freeware
  11. Linux-Site
  12. 윈디하나의 솔라나라

Site Stats

TOTAL 2683579 HIT
TODAY 413 HIT
YESTERDAY 438 HIT