[ Pobierz całość w formacie PDF ]
.jar; doCLASSPATH=${CLASSPATH}:$f;done# add libs to CLASSPATHfor f in $HADOOP_HOME/lib/*.jar; doCLASSPATH=${CLASSPATH}:$f;donefor f in $HADOOP_HOME/lib/jetty-ext/*.jar; doCLASSPATH=${CLASSPATH}:$f;doneLIBJVM=`find -L $JAVA_HOME -wholename '*/server/libjvm.so' -print | tail -1`if [ -z "${LIBJVM}" ]; thenecho "Unable to find libjvm.so in JAVA_HOME $JAVA_HOME" 1>&2exit 1fi# prefer the libhdfs in buildLIBHDFS=`find $PWD/libhdfs $PWD/build -iname libhdfs.so -print | tail -1`if [ -z "${LIBHDFS}" ]; thenecho "Unable to find libhdfs.so in libhdfs or build" 1>&2fiif [ -z "${LD_LIBRARY_PATH}" ]; thenLD_LIBRARY_PATH="`dirname "${LIBJVM}"`:`dirname "${LIBHDFS}"`"elseLD_LIBRARY_PATH="`dirname "${LIBJVM}"`:`dirname "${LIBHDFS}"`":"${LD_LIBRARY_PATH}"fiecho "export CLASSPATH='${CLASSPATH}'"echo "export LD_LIBRARY_PATH='${LD_LIBRARY_PATH}'"After the runtime environment is correctly configured, the fuse_dfs program can be runby using the command-line arguments shown in Table 8-5.255CHAPTER 8 %ÿþ ADVANCED AND ALTERNATE MAPREDUCE TECHNIQUESTable 8-5.fuse_dfs Command-Line ArgumentsArgument Default Suggested Value Descriptionserver None required NameNode hostname The server to connect tofor HDFS servers.port None required NameNode port The port that theNameNode listens forHDFS requests on.entry_timeout 60 - The cache timeout fornames.attribute_timeout 60 - The cache timeout forattributes.protected None /user:/tmp The list of exact pathsthat fuse_dfs will notdelete or move.rdbuffer 10485760 10485760 The size of the bufferused for reading fromHDFS.private None None Allows only the user run-ning fuse_dfs to accessthe file system.ro N/A N/A Mounts the file systemread-only.rw N/A N/A Mounts the file systemread-write.debug N/A N/A Enables debugging mes-sages and runs in theforeground.initchecks N/A N/A Performs environmentchecks and logs resultson startup.nopermissions enabled enabled Does not do permissionchecking; permissionchecking not supportedas of Hadoop 0.19.big_writes None enabled Configures fuse_dfs touse large writes.usetrash enabled enabled Uses the trash directorywhen deleting files.notrash disabled disabled Does not use the trashdirectory when deletingfiles.Does not work inHadoop 0.19.The arguments in Table 8-6 are arguments that are passed to the underlying FUSE imple-mentation, not handled directly by fuse_dfs.256CHAPTER 8 %ÿþ ADVANCED AND ALTERNATE MAPREDUCE TECHNIQUESTable 8-6.Selected FUSE Command-Line ArgumentsArgument Default Suggested Value Descriptionallow_other None enabled Allows access to other users.allow_root None disabled Allows only root access.nonempty None disabled Allows mounts over non-empty file ordirectory.fsname None None Sets the file system name for /etc/mtab.subtype None None Sets file system type for /etc/mtab.direct_io None None Uses direct I/O instead of buffered I/O.kernel_cache None None Caches files in kernel.[no]auto_cache None None Enables caching based on modification times(off).The following command will mount a read-only HDFS file system with debugging on.Thefs.default.name for the file system being mounted is hdfs://cloud9:8020.The mount pointfor the file system is /mnt/hdfs, and the arguments after the /mnt/hdfs are passed to the FUSEsubsystem.These are reasonable arguments for mounting an HDFS file system:./fuse_dfs -oserver=cloud9 -oport=8020 -oro -oinitchecks -oallow_other /mnt/hdfs¥'-o fsname="HDFS" -o debugIt is possible to set up a Linux system so that an HDFS is mounted at system start time byupdating the system /etc/fstab file with a mount request for an HDFS file system.To set upsystem-managed mounts via /etc/fstab, a script /bin/fuse_dfs must be created that sets upthe environment and then passes the command-line arguments to the actual fuse_dfs pro-gram.This script just sets up the CLASSPATH environment variable and the LD_LIBRARY_PATHvariable as the script in Listing 8-4 does.A candidate line for use in /etc/fstab is to mount an HDFS file system at system initial-ization time.The mount script, for the /etc/fstab entry in Listing 8-5, would be passed fourarguments.To actually auto mount, the following could be added: ${HADOOP_HOME}/build/contrib/fuse-dfs/fuse_dfs "$3" "$4" "$1" "$2".And the script could be placed in /bin (seethe script bin_fuse_dfs in the examples).Listing 8-5.A Candidate Mount Line for /etc/fstab to Mount an HDFS File Systemfuse_dfs#dfs://at:9020 /mnt/hdfs fuse rw,usetrash,allow_other,initchecks 0 0Alternate MapReduce TechniquesThe traditional MapReduce job reads a set of input data, performs some transformations inthe map phase, sorts the results, performs another transformation in the reduce phase, andwrites a set of output data.The sorting stage requires data to be transferred across the networkand also requires the computational expense of sorting.In addition, the input data is readfrom and the output data is written to HDFS.The overhead involved in passing data betweenHDFS and the map phase, and the overhead involved in moving the data during the sort stage,and the writing of data to HDFS at the end of the job result in application design patterns that257CHAPTER 8 %ÿþ ADVANCED AND ALTERNATE MAPREDUCE TECHNIQUEShave large complex map methods and potentially complex reduce methods, to minimize thenumber of times the data is passed through the cluster.Many processes require multiple steps, some of which require a reduce phase, leaving atleast one input to the next job step already sorted.Having to re-sort this data may use signifi-cant cluster resources.The following section goes into detail about a variety of techniques that are helpful forspecial situations
[ Pobierz całość w formacie PDF ]