HDFS - DFS 블록 사이즈 수정 업로드

Posted 05 18, 2013 16:37, Filed under: BigData/Hadoop

# DFS 블록 사이즈 수정

맵리듀스 잡은 수행되는 맵 태스크와 리듀스 태스크개수에 따라 성능에 영향을 받게 됩니다.
HDFS에 파일을 업로드하면 64MB 단위로 파일이 분리되어 저장되는데, 이때 계산된 블록 수만큼 맵 태스크 개수가 산출됩니다.
블록 사이즈인 64MB는 하둡 환경설정 파일인 hdfs-site.xml 에서 dfs-block-size 프로퍼티로 설정할 수 있으며 별도 설정값이 없을 경우 기본값으로 64MB를 사용합니다.

같은 사이즈의 파일이더라도 더 많은 수의 블록으로 분리되면 그만큼 많은 맵 태스크가 수행되면서 작업도 빠르게 종료될 것입니다.
이때 64MB 보다 작은 크기로 분리하면 더 많은 블록으로 분리되고, 맵 태스크도 그만큼 더 실행 될 것입니다.

hdfs-site.xml에서 설정하면 HDFS에 업로드하는 전체 파일에 해당 블록 사이즈가 적용됩니다.
특정 파일의 블록 사이즈만 변경할 경우, 하둡 명령어에서 제공하는 distcp 옵션을 다음과 같이 실행합니다.

./bin/hadoop distcp -Ddfs.block.size=[HDFS 블록 사이즈] [입력 경로(로컬이 아니고, HDFS에 존재하는 파일)] [출력 경로]


distcp는 원래 파일 복사에 사용하는 옵션인데, dfs.block.size 옵션을 지정해서 블록 사이즈를 변경할 수 있습니다.
"-Ddfs.block.size"의 HDFS 블록 사이즈는 반드시 바이트 단위로 입력해야 합니다.


[hadoop@master bin]$ ./hadoop fs -mkdir data_32mb
[hadoop@master bin]$ ./hadoop fs -ls
Found 12 items
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:19 /user/hadoop/.Trash
drwxr-xr-x   - hadoop supergroup          0 2013-05-15 10:55 /user/hadoop/data
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:21 /user/hadoop/data_32mb

make_32mb_files.sh
#! /bin/bash

for year in `seq 1987 2008`
do
   ./bin/hadoop distcp -Ddfs.block.size=$[32*1024*1024] /user/hadoop/data/$year.csv /user/hadoop/data_32mb/$year.csv
done

# 파일 복사 실행
[hadoop@master hadoop]$ ./make_32mb_files.sh
13/05/18 16:26:08 INFO tools.DistCp: srcPaths=[/user/hadoop/data/1987.csv]
13/05/18 16:26:08 INFO tools.DistCp: destPath=/user/hadoop/data_32mb/1987.csv
13/05/18 16:26:09 INFO tools.DistCp: sourcePathsCount=1
13/05/18 16:26:09 INFO tools.DistCp: filesToCopyCount=1
13/05/18 16:26:09 INFO tools.DistCp: bytesToCopyCount=121.3m

13/05/18 16:26:09 INFO mapred.JobClient: Running job: job_201305181613_0025
13/05/18 16:26:10 INFO mapred.JobClient:  map 0% reduce 0%
13/05/18 16:26:15 INFO mapred.JobClient:  map 100% reduce 0%
13/05/18 16:26:15 INFO mapred.JobClient: Job complete: job_201305181613_0025
13/05/18 16:26:15 INFO mapred.JobClient: Counters: 22
13/05/18 16:26:15 INFO mapred.JobClient:   Job Counters
13/05/18 16:26:15 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4847
13/05/18 16:26:15 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/05/18 16:26:15 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/05/18 16:26:15 INFO mapred.JobClient:     Launched map tasks=1
13/05/18 16:26:15 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/05/18 16:26:15 INFO mapred.JobClient:   File Input Format Counters
13/05/18 16:26:15 INFO mapred.JobClient:     Bytes Read=234
13/05/18 16:26:15 INFO mapred.JobClient:   File Output Format Counters
13/05/18 16:26:15 INFO mapred.JobClient:     Bytes Written=0
13/05/18 16:26:15 INFO mapred.JobClient:   FileSystemCounters
13/05/18 16:26:15 INFO mapred.JobClient:     HDFS_BYTES_READ=127163348
13/05/18 16:26:15 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=51951
13/05/18 16:26:15 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=127162942
13/05/18 16:26:15 INFO mapred.JobClient:   distcp
13/05/18 16:26:15 INFO mapred.JobClient:     Files copied=1
13/05/18 16:26:15 INFO mapred.JobClient:     Bytes copied=127162942
13/05/18 16:26:15 INFO mapred.JobClient:     Bytes expected=127162942
13/05/18 16:26:15 INFO mapred.JobClient:   Map-Reduce Framework
13/05/18 16:26:15 INFO mapred.JobClient:     Map input records=1
13/05/18 16:26:15 INFO mapred.JobClient:     Physical memory (bytes) snapshot=105881600
13/05/18 16:26:15 INFO mapred.JobClient:     Spilled Records=0
13/05/18 16:26:15 INFO mapred.JobClient:     CPU time spent (ms)=2230
13/05/18 16:26:15 INFO mapred.JobClient:     Total committed heap usage (bytes)=116260864
13/05/18 16:26:15 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=744169472
13/05/18 16:26:15 INFO mapred.JobClient:     Map input bytes=134
13/05/18 16:26:15 INFO mapred.JobClient:     Map output records=0
13/05/18 16:26:15 INFO mapred.JobClient:     SPLIT_RAW_BYTES=172
13/05/18 16:26:15 INFO tools.DistCp: srcPaths=[/user/hadoop/data/1988.csv]
13/05/18 16:26:15 INFO tools.DistCp: destPath=/user/hadoop/data_32mb/1988.csv
13/05/18 16:26:16 INFO tools.DistCp: sourcePathsCount=1
13/05/18 16:26:16 INFO tools.DistCp: filesToCopyCount=1
13/05/18 16:26:16 INFO tools.DistCp: bytesToCopyCount=477.8m
13/05/18 16:26:16 INFO mapred.JobClient: Running job: job_201305181613_0026
^C13/05/18 16:26:17 INFO tools.DistCp: srcPaths=[/user/hadoop/data/1989.csv]
13/05/18 16:26:17 INFO tools.DistCp: destPath=/user/hadoop/data_32mb/1989.csv
13/05/18 16:26:18 INFO tools.DistCp: sourcePathsCount=1
13/05/18 16:26:18 INFO tools.DistCp: filesToCopyCount=1
13/05/18 16:26:18 INFO tools.DistCp: bytesToCopyCount=464.0m
13/05/18 16:26:18 INFO mapred.JobClient: Running job: job_201305181613_0027
13/05/18 16:26:19 INFO mapred.JobClient:  map 0% reduce 0%
^C13/05/18 16:26:20 INFO tools.DistCp: srcPaths=[/user/hadoop/data/1990.csv]
13/05/18 16:26:20 INFO tools.DistCp: destPath=/user/hadoop/data_32mb/1990.csv
^C13/05/18 16:26:21 INFO tools.DistCp: srcPaths=[/user/hadoop/data/1991.csv]
13/05/18 16:26:21 INFO tools.DistCp: destPath=/user/hadoop/data_32mb/1991.csv
13/05/18 16:26:21 INFO tools.DistCp: sourcePathsCount=1
13/05/18 16:26:21 INFO tools.DistCp: filesToCopyCount=1
13/05/18 16:26:21 INFO tools.DistCp: bytesToCopyCount=468.5m
13/05/18 16:29:59 INFO mapred.JobClient: Running job: job_201305181613_0045
13/05/18 16:30:00 INFO mapred.JobClient:  map 0% reduce 0%

13/05/18 16:30:09 INFO mapred.JobClient:  map 100% reduce 0%
13/05/18 16:30:09 INFO mapred.JobClient: Job complete: job_201305181613_0045
13/05/18 16:30:09 INFO mapred.JobClient: Counters: 22
13/05/18 16:30:09 INFO mapred.JobClient:   Job Counters
13/05/18 16:30:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=8912
13/05/18 16:30:09 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/05/18 16:30:09 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/05/18 16:30:09 INFO mapred.JobClient:     Launched map tasks=1
13/05/18 16:30:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/05/18 16:30:09 INFO mapred.JobClient:   File Input Format Counters
13/05/18 16:30:09 INFO mapred.JobClient:     Bytes Read=254
13/05/18 16:30:09 INFO mapred.JobClient:   File Output Format Counters
13/05/18 16:30:09 INFO mapred.JobClient:     Bytes Written=0
13/05/18 16:30:09 INFO mapred.JobClient:   FileSystemCounters
13/05/18 16:30:09 INFO mapred.JobClient:     HDFS_BYTES_READ=689413770
13/05/18 16:30:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=51951
13/05/18 16:30:09 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=689413344
13/05/18 16:30:09 INFO mapred.JobClient:   distcp
13/05/18 16:30:09 INFO mapred.JobClient:     Files copied=1
13/05/18 16:30:09 INFO mapred.JobClient:     Bytes copied=689413344
13/05/18 16:30:09 INFO mapred.JobClient:     Bytes expected=689413344
13/05/18 16:30:09 INFO mapred.JobClient:   Map-Reduce Framework
13/05/18 16:30:09 INFO mapred.JobClient:     Map input records=1
13/05/18 16:30:09 INFO mapred.JobClient:     Physical memory (bytes) snapshot=185475072
13/05/18 16:30:09 INFO mapred.JobClient:     Spilled Records=0
13/05/18 16:30:09 INFO mapred.JobClient:     CPU time spent (ms)=8370
13/05/18 16:30:09 INFO mapred.JobClient:     Total committed heap usage (bytes)=157548544
13/05/18 16:30:09 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=742912000
13/05/18 16:30:09 INFO mapred.JobClient:     Map input bytes=154
13/05/18 16:30:09 INFO mapred.JobClient:     Map output records=0
13/05/18 16:30:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=172

# 파일 조회
[hadoop@master bin]$ ./hadoop fs -ls data_32mb
Found 67 items
-rw-r--r--   1 hadoop supergroup  127162942 2013-05-18 16:26 /user/hadoop/data_32mb/1987.csv
-rw-r--r--   1 hadoop supergroup  501039472 2013-05-18 16:26 /user/hadoop/data_32mb/1988.csv
-rw-r--r--   1 hadoop supergroup  486518821 2013-05-18 16:26 /user/hadoop/data_32mb/1989.csv
-rw-r--r--   1 hadoop supergroup  509194687 2013-05-18 16:21 /user/hadoop/data_32mb/1990.csv
-rw-r--r--   1 hadoop supergroup  491210093 2013-05-18 16:26 /user/hadoop/data_32mb/1991.csv
-rw-r--r--   1 hadoop supergroup  492313731 2013-05-18 16:26 /user/hadoop/data_32mb/1992.csv
-rw-r--r--   1 hadoop supergroup  490753652 2013-05-18 16:26 /user/hadoop/data_32mb/1993.csv
-rw-r--r--   1 hadoop supergroup  501558665 2013-05-18 16:27 /user/hadoop/data_32mb/1994.csv
-rw-r--r--   1 hadoop supergroup  530751568 2013-05-18 16:27 /user/hadoop/data_32mb/1995.csv
-rw-r--r--   1 hadoop supergroup  533922363 2013-05-18 16:27 /user/hadoop/data_32mb/1996.csv
-rw-r--r--   1 hadoop supergroup  540347861 2013-05-18 16:27 /user/hadoop/data_32mb/1997.csv
-rw-r--r--   1 hadoop supergroup  538432875 2013-05-18 16:27 /user/hadoop/data_32mb/1998.csv
-rw-r--r--   1 hadoop supergroup  552926022 2013-05-18 16:28 /user/hadoop/data_32mb/1999.csv
-rw-r--r--   1 hadoop supergroup  570151613 2013-05-18 16:28 /user/hadoop/data_32mb/2000.csv
-rw-r--r--   1 hadoop supergroup  600411462 2013-05-18 16:28 /user/hadoop/data_32mb/2001.csv
-rw-r--r--   1 hadoop supergroup  530507013 2013-05-18 16:28 /user/hadoop/data_32mb/2002.csv
-rw-r--r--   1 hadoop supergroup  626745242 2013-05-18 16:28 /user/hadoop/data_32mb/2003.csv
-rw-r--r--   1 hadoop supergroup  669879113 2013-05-18 16:29 /user/hadoop/data_32mb/2004.csv
-rw-r--r--   1 hadoop supergroup  671027265 2013-05-18 16:29 /user/hadoop/data_32mb/2005.csv
-rw-r--r--   1 hadoop supergroup  672068096 2013-05-18 16:29 /user/hadoop/data_32mb/2006.csv
-rw-r--r--   1 hadoop supergroup  702878193 2013-05-18 16:29 /user/hadoop/data_32mb/2007.csv
-rw-r--r--   1 hadoop supergroup  689413344 2013-05-18 16:30 /user/hadoop/data_32mb/2008.csv
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:22 /user/hadoop/data_32mb/_distcp_logs_1sqd7l
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:27 /user/hadoop/data_32mb/_distcp_logs_25pfag
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:23 /user/hadoop/data_32mb/_distcp_logs_2ccjcs
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:28 /user/hadoop/data_32mb/_distcp_logs_2e33mb
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:25 /user/hadoop/data_32mb/_distcp_logs_3kh0ma
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:22 /user/hadoop/data_32mb/_distcp_logs_45191u
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:23 /user/hadoop/data_32mb/_distcp_logs_4ei2q8
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:21 /user/hadoop/data_32mb/_distcp_logs_74x555
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:26 /user/hadoop/data_32mb/_distcp_logs_7aitid
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:22 /user/hadoop/data_32mb/_distcp_logs_8i4eop
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:24 /user/hadoop/data_32mb/_distcp_logs_9eleh1
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:28 /user/hadoop/data_32mb/_distcp_logs_b8sgs8
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:28 /user/hadoop/data_32mb/_distcp_logs_bfyj3j
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:26 /user/hadoop/data_32mb/_distcp_logs_c8fie
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:23 /user/hadoop/data_32mb/_distcp_logs_cg5gbm
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:29 /user/hadoop/data_32mb/_distcp_logs_df0r7y
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:23 /user/hadoop/data_32mb/_distcp_logs_e8h0nk
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:29 /user/hadoop/data_32mb/_distcp_logs_ee22te
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:29 /user/hadoop/data_32mb/_distcp_logs_fbq4yb
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:21 /user/hadoop/data_32mb/_distcp_logs_grv1cz
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:26 /user/hadoop/data_32mb/_distcp_logs_h0y3fh
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:29 /user/hadoop/data_32mb/_distcp_logs_h500ct
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:29 /user/hadoop/data_32mb/_distcp_logs_ilpe24
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:28 /user/hadoop/data_32mb/_distcp_logs_kaaajw
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:22 /user/hadoop/data_32mb/_distcp_logs_kd6h1l
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:25 /user/hadoop/data_32mb/_distcp_logs_kz6it
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:25 /user/hadoop/data_32mb/_distcp_logs_lubslr
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:22 /user/hadoop/data_32mb/_distcp_logs_o517l
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:21 /user/hadoop/data_32mb/_distcp_logs_o9j2lv
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:30 /user/hadoop/data_32mb/_distcp_logs_phx0ja
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:27 /user/hadoop/data_32mb/_distcp_logs_r6apwh
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:26 /user/hadoop/data_32mb/_distcp_logs_sbc0yu
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:27 /user/hadoop/data_32mb/_distcp_logs_sufgv6
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:24 /user/hadoop/data_32mb/_distcp_logs_tgs90k
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:23 /user/hadoop/data_32mb/_distcp_logs_ubx3vk
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:24 /user/hadoop/data_32mb/_distcp_logs_uqpgu9
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:25 /user/hadoop/data_32mb/_distcp_logs_uzgm3f
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:24 /user/hadoop/data_32mb/_distcp_logs_v08vfb
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:26 /user/hadoop/data_32mb/_distcp_logs_v6jzdf
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:27 /user/hadoop/data_32mb/_distcp_logs_vd4q7z
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:24 /user/hadoop/data_32mb/_distcp_logs_vw5iho
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:26 /user/hadoop/data_32mb/_distcp_logs_w04ak0
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:27 /user/hadoop/data_32mb/_distcp_logs_wqu49c
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:26 /user/hadoop/data_32mb/_distcp_tmp_7aitid
drwxr-xr-x   - hadoop supergroup          0 2013-05-18 16:26 /user/hadoop/data_32mb/_distcp_tmp_c8fie


참고 : 시작하세요! 하둡 프로그래밍


※ 위 내용은, 여러 자료를 참고하거나 제가 주관적으로 정리한 것입니다.
   잘못된 정보나 보완이 필요한 부분을, 댓글 또는 메일로 보내주시면 많은 도움이 되겠습니다.
05 18, 2013 16:37 05 18, 2013 16:37


Trackback URL : http://develop.sunshiny.co.kr/trackback/902

Leave a comment

« Previous : 1 : ... 115 : 116 : 117 : 118 : 119 : 120 : 121 : 122 : 123 : ... 648 : Next »

Recent Posts

  1. HDFS - Python Encoding 오류 처리
  2. HP - Vertica ROS Container 관련 오류...
  3. HDFS - Hive 실행시 System Time 오류
  4. HP - Vertica 사용자 쿼리 이력 테이블...
  5. Client에서 HDFS 환경의 데이터 처리시...

Recent Comments

  1. 안녕하세요^^ 배그핵
  2. 안녕하세요^^ 도움이 되셨다니, 저... sunshiny
  3. 정말 큰 도움이 되었습니다.. 감사합... 사랑은
  4. 네, 안녕하세요. 댓글 남겨 주셔서... sunshiny
  5. 감사합니다 많은 도움 되었습니다!ㅎㅎ 프리시퀸스

Recent Trackbacks

  1. see page see page %M
  2. find out this here find out this here %M
  3. amazon fire streaming amazon fire streaming %M
  4. roku channel builder roku channel builder %M
  5. clocks for facilities clocks for facilities %M

Calendar

«   12 2019   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        

Bookmarks

  1. 위키피디아
  2. MysqlKorea
  3. 오라클 클럽
  4. API - Java
  5. Apache Hadoop API
  6. Apache Software Foundation
  7. HDFS 생태계 솔루션
  8. DNSBL - Spam Database Lookup
  9. Ready System
  10. Solaris Freeware
  11. Linux-Site
  12. 윈디하나의 솔라나라

Site Stats

TOTAL 2777452 HIT
TODAY 177 HIT
YESTERDAY 468 HIT