MyException - 我的异常网
当前位置:我的异常网» 行业应用 » MapReduce的工作原理-札记

MapReduce的工作原理-札记

www.MyException.Cn  网友分享于:2013-10-07  浏览:14次
MapReduce的工作原理-笔记
To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem.It then creates one map task for each split.

事实上,当输入数据足够小的时候,它会按照你设定的mapper数目来执行。

Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for example, a tasktracker may be able to run two map tasks and two reduce tasks simultaneously. (The precise number depends on the number of cores and the amount of memory on the tasktracker; see “Memory” on page 254.) The default scheduler fills empty map task slots before reduce task slots, so if the tasktracker has at least one empty map task slot, the jobtracker will select a map task; otherwise, it will select a reduce task.



To choose a reduce task the jobtracker simply takes the next in its list of yet-to-be-run reduce tasks, since there are no data locality considerations.

reducer不考虑本地化。

For a map task, however, it takes account of the tasktracker’s network location and picks a task whose input split is as close as possible to the tasktracker. In the optimal case, the task is data-local, that is, running on the same node that the split resides on. Alternatively, the task may be rack-local: on the same rack, but not the same node, as the split.

mapper会考虑数据的本地化。

TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so that any bugs in the user-defined map and reduce functions don’t affect the tasktracker (by causing it to crash or hang, for example). It is however possible to reuse the JVM between tasks; see “Task JVM Reuse” on page 170.

map和reduce都在JVM中运行

When a task is running, it keeps track of its progress, that is, the proportion of the task completed. For map tasks, this is the proportion of the input that has been processed. For reduce tasks, it’s a little more complex, but the system can still estimate the proportion of the reduce input processed. It does this by dividing the total progress into three parts, corresponding to the three phases of the shuffle (see “Shuffle and Sort” on page 163). For example, if the task has run the reducer on half its input, then the task’s progress is ⅚, since it has completed the copy and sort phases (⅓ each) and is half way through the reduce phase (⅙).

Shuffer和Sort占的比例这么多?

Streaming里面用管道的时候,就没办法进行进度更新了。

Setting the timeout to a value of zero disables the timeout, so long-running tasks are never marked as failed. In this case, a hanging task will never free up its slot, and over time there may be cluster slowdown as a result. This approach should therefore be avoided, and making sure that a task is reporting progress periodically will suffice (see “What Constitutes Progress in MapReduce?” on page 158).

mapred.task.timeout=0,还是有很大的风险的

If a Streaming process hangs, the tasktracker does not try to kill it (although the JVM that launched it will be killed), so you should take precautions to monitor for this scenario, and kill orphaned processes by some other means.

这个还真的碰到好几次。不知道是不是叫做僵尸进程?

For some applications it is undesirable to abort the job if a few tasks fail, as it may be possible to use the results of the job despite some failures. In this case, the maximum percentage of tasks that are allowed to fail without triggering job failure can be set for the job. Map tasks and reduce tasks are controlled independently, using the mapred.max.map.failures.percent and mapred.max.reduce.failures.percent properties.

这个在项目投产的时候可以考虑设置一下。

Failure of a tasktracker is another failure mode. If a tasktracker fails by crashing, or running very slowly, it will stop sending heartbeats to the jobtracker (or send them very infrequently). The jobtracker will notice a tasktracker that has stopped sending heartbeats (if it hasn’t received one for 10 minutes, configured via the mapred.task tracker.expiry.interval property, in milliseconds) and remove it from its pool of tasktrackers to schedule tasks on.



The jobtracker arranges for map tasks that were run and completed successfully on that tasktracker to be rerun if they belong to incomplete jobs, since their intermediate output residing on the failed tasktracker’s local filesystem may not be accessible to the reduce task. Any tasks in progress are also rescheduled.

中文书貌似翻译的有点问题,这个的意思是tasktracker只要是未完成的job,map都要重跑。

The Fair Scheduler aims to give every user a fair share of the cluster capacity over time. If a single job is running, it gets all of the cluster. As more jobs are submitted, free task slots are given to the jobs in such a way as to give each user a fair share of the cluster. A short job belonging to one user will complete in a reasonable time even while another user’s long job is running, and the long job will still make progress.

Fair Scheduler 这么好的东西,之前一直都没有用过。很适合大规模集群投产使用。

In many ways, the shuffle is the heart of MapReduce, and is where the “magic” happens.

有没有这么牛?

It is often a good idea to compress the map output as it is written to disk, since doing so makes it faster to write to disk, saves disk space, and reduces the amount of data to transfer to the reducer. By default the output is not compressed, but it is easy to enable by setting mapred.compress.map.output to true. The compression library to use is specified by mapred.map.output.compression.codec; see “Compression” on page 77 for more on compression formats.

压缩倒是个好注意。

The output file’s partitions are made available to the reducers over HTTP. The number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property—this setting is per tasktracker, not per map task slot. The default of 40 may need increasing for large clusters running large jobs.

没错,就是HTTP!

split和partition

split是map的输入,map溢写到硬盘时按照分区(partition)进行内排,partition这个值应该跟reducer_task的数目有关系,并得到一个溢写文件,map结束时对溢写文件进行合并,每个reduce在复制阶段,把属于自己partition的数据拷贝(copy phase)过来,在reducer开始执行时再把所有partition的数据再进行合并(sort phase)。

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.

Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task.

复制阶段。map的输出到特定的分块好像是由一个哈希函数做的。

The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.

map没结束,reduce的进度就开始有报告,就是这个原因。

As map tasks complete successfully, they notify their parent tasktracker of the status update, which in turn notifies the jobtracker. These notifications are transmitted over the heartbeat communication mechanism described earlier. Therefore, for a given job, the jobtracker knows the mapping between map outputs and tasktrackers. A thread in the reducer periodically asks the jobtracker for map output locations until it has retrieved them all.

Tasktrackers do not delete map outputs from disk as soon as the first reducer has retrieved them, as the reducer may fail. Instead, they wait until they are told to delete them by the jobtracker, which is after the job has completed.

mapping是怎么做的?map跑完的数据要在job完成之后才删除?那中间数据不是很恐怖?

The configuration property io.sort.factor controls the maximum number of streams to merge at once; the default is 10.

据说这个值可以设到100。

The number of files merged in each round is actually more subtle than this example suggests. The goal is to merge the minimum number of files to get to the merge factor for the final round. So if there were 40 files, the merge would not merge 10 files in each of the four rounds to get 4 files. Instead, the first round would merge only 4 files, and the subsequent three rounds would merge the full 10 files. The 4 merged files, and the 6 (as yet unmerged) files make a total of 10 files for the final round.

合并的细节:意思是最后要留下十个文件,那就10个文件合并三次,剩下10个文件+3个大文件,所以第一次就先合并4个文件,可以减少写磁盘的数据。

using exponential backoff

类似QQ的超时重连,每次重连的时间间隔越来越长。

The general principle is to give the shuffle as much memory as possible. However, there is a trade-off, in that you need to make sure that your map and reduce functions get enough memory to operate. This is why it is best to write your map and reduce functions to use as little memory as possible—certainly they should not use an unbounded amount of memory (by avoiding accumulating values in a map, for example).

map和reduce函数应该尽量少用内存

The amount of memory given to the JVMs in which the map and reduce tasks run is set by the mapred.child.java.opts property.

多大啊?

More generally, Hadoop’s uses a buffer size of 4 KB by default, which is low, so you should increase this across the cluster (by setting io.file.buffer.size, see also “Other Hadoop Properties” on page 264).

有空试试!

Speculative execution is an optimization, not a feature to make jobs run more reliably. If there are bugs that sometimes cause a task to hang or slow down, then relying on speculative execution to avoid these problems is unwise, and won’t work reliably, since the same bugs are likely to affect the speculative task. You should fix the bug so that the task doesn’t hang or slow down. mapred.reduce.tasks.speculative.execution

我也很烦这个,关掉它。

The property for controlling task JVM reuse is mapred.job.reuse.jvm.num.tasks: it specifies the maximum number of tasks to run for a given job for each JVM launched, the default is 1 (see Table 6-4). Tasks from different jobs are always run in separate JVMs. If the property is set to –1, there is no limit to the number of tasks from the same job that may share a JVM. The method setNumTasksToExecutePerJvm() on JobConf can also be used to configure this property.

共享JVM能不能减少内存使用呢?

Tasks that are CPU-bound may also benefit from task JVM reuse by taking advantage of runtime optimizations applied by the HotSpot JVM. After running for a while, the HotSpot JVM builds up enough information to detect performance-critical sections in the code, and dynamically translates the Java byte codes of these hot spots into native machine code. This works well for long-running processes, but JVMs that run for seconds or a few minutes may not gain the full benefit of HotSpot. In these cases, it is worth enabling task JVM reuse.

Another place where a shared JVM is useful is for sharing state between the tasks of a job. By storing reference data in a static field, tasks get rapid access to the shared data.

共享JVM,CPU密集型的程序有优势。

Skipping mode is off by default; you enable it independently for map and reduce tasks using the SkipBadRecords class. It’s important to note that skipping mode can detect only one bad record per task attempt, so this mechanism is appropriate only for detecting occasional bad records (a few per task, say). You may need to increase the maximum number of task attempts (via mapred.map.max.attempts and mapred.reduce.max.attempts) to give skipping mode enough attempts to detect and skip all the bad records in an input split.

Skipping模式的启动。对于跑任务来说,这个选项非常重要。

文章评论

聊聊HTTPS和SSL/TLS协议
聊聊HTTPS和SSL/TLS协议
Java程序员必看电影
Java程序员必看电影
编程语言是女人
编程语言是女人
做程序猿的老婆应该注意的一些事情
做程序猿的老婆应该注意的一些事情
60个开发者不容错过的免费资源库
60个开发者不容错过的免费资源库
为啥Android手机总会越用越慢?
为啥Android手机总会越用越慢?
初级 vs 高级开发者 哪个性价比更高?
初级 vs 高级开发者 哪个性价比更高?
老程序员的下场
老程序员的下场
老美怎么看待阿里赴美上市
老美怎么看待阿里赴美上市
什么才是优秀的用户界面设计
什么才是优秀的用户界面设计
程序员应该关注的一些事儿
程序员应该关注的一些事儿
如何成为一名黑客
如何成为一名黑客
旅行,写作,编程
旅行,写作,编程
 程序员的样子
程序员的样子
程序员周末都喜欢做什么?
程序员周末都喜欢做什么?
程序员都该阅读的书
程序员都该阅读的书
鲜为人知的编程真相
鲜为人知的编程真相
我的丈夫是个程序员
我的丈夫是个程序员
代码女神横空出世
代码女神横空出世
程序员最害怕的5件事 你中招了吗?
程序员最害怕的5件事 你中招了吗?
漫画:程序员的工作
漫画:程序员的工作
10个调试和排错的小建议
10个调试和排错的小建议
一个程序员的时间管理
一个程序员的时间管理
我是如何打败拖延症的
我是如何打败拖延症的
团队中“技术大拿”并非越多越好
团队中“技术大拿”并非越多越好
那些争议最大的编程观点
那些争议最大的编程观点
要嫁就嫁程序猿—钱多话少死的早
要嫁就嫁程序猿—钱多话少死的早
科技史上最臭名昭著的13大罪犯
科技史上最臭名昭著的13大罪犯
如何区分一个程序员是“老手“还是“新手“?
如何区分一个程序员是“老手“还是“新手“?
5款最佳正则表达式编辑调试器
5款最佳正则表达式编辑调试器
Web开发人员为什么越来越懒了?
Web开发人员为什么越来越懒了?
不懂技术不要对懂技术的人说这很容易实现
不懂技术不要对懂技术的人说这很容易实现
Java 与 .NET 的平台发展之争
Java 与 .NET 的平台发展之争
中美印日四国程序员比较
中美印日四国程序员比较
看13位CEO、创始人和高管如何提高工作效率
看13位CEO、创始人和高管如何提高工作效率
10个帮程序员减压放松的网站
10个帮程序员减压放松的网站
程序员和编码员之间的区别
程序员和编码员之间的区别
“懒”出效率是程序员的美德
“懒”出效率是程序员的美德
程序员的鄙视链
程序员的鄙视链
当下全球最炙手可热的八位少年创业者
当下全球最炙手可热的八位少年创业者
“肮脏的”IT工作排行榜
“肮脏的”IT工作排行榜
写给自己也写给你 自己到底该何去何从
写给自己也写给你 自己到底该何去何从
程序员的一天:一寸光阴一寸金
程序员的一天:一寸光阴一寸金
Web开发者需具备的8个好习惯
Web开发者需具备的8个好习惯
十大编程算法助程序员走上高手之路
十大编程算法助程序员走上高手之路
软件开发程序错误异常ExceptionCopyright © 2009-2015 MyException 版权所有