MyException - 我的异常网
当前位置:我的异常网» 行业应用 » Hadoop 3.x重要变化(相对于Hadoop 2.x)

Hadoop 3.x重要变化(相对于Hadoop 2.x)

www.MyException.Cn  网友分享于:2018-03-12  浏览:0次
Hadoop 3.x主要变化(相对于Hadoop 2.x)

 

       今天有人问Hadoop 3.x的主要变动在哪里,这里在官网(http://hadoop.apache.org/docs/r3.0.0/index.html)查了下,总结简单翻译如下:

  • 1.要求JDK>=1.7
  • 2.HDFS支持纠删码

       与副本相比纠删码是一种更节省空间的数据持久化存储方法。标准编码(比如Reed-Solomon(10,4))会有

1.4 倍的空间开销;然而HDFS副本则会有3倍的空间开销。因为纠删码额外开销主要是在重建和执行远程读,它传统用于存储冷数据,即不经常访问的数据。当部署这个新特性时用户应该考虑纠删码的网络和CPU 开销。更多关于HDFS的纠删码可以参见http://hadoop.apache.org/docs/r3.0.0-beta1/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html.
  • 3.YARN Timeline Service版本更新到v.2

        本版本引入了Yarn时间抽服务v.2,主要用于解决2大挑战:改善时间轴服务的可伸缩性和可靠性,通过引入流和聚合增强可用性。

YARN Timeline Service v.2 alpha 1可以让用户和开发者测试以及反馈,以便使得它可以替换现在的Timeline Service v.1.x。请在测试环境中使用。更多关于YARN Timeline Service v.2的知识请参见http://hadoop.apache.org/docs/r3.0.0-beta1/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html

  • 4.重写相关shell脚本,比如所有脚本都以hadoop-env.sh为基础脚本等等

        Hadoop的Shell脚本被重写解决了之前很多长期存在的bug,并且引入了一些新的特性。绝大部分都保持兼容性,不过仍有些变化可能使得现有的安装不能正常运行。不兼容的改变可以参见HADOOP-9902。更多内容请参见Unix Shell Guide文档。即使你是资深用户,也建议看下这个文档,因为其描述了许多新的功能,特别是与可扩展性有关的功能。

  • 5.合并客户端jar,比如使用maven的shaded插件将 hadoop-client-api和hadoop-client-runtime合并为一个jar

        在 Hadoop 2.x 版本,hadoop-client Maven artifact将 Hadoop 所有的依赖都加到 Hadoop 应用程序的环境变量中,这样会可能会导致应用程序依赖的类和 Hadoop 依赖的类有冲突。这个问题在 HADOOP-11804 得到了解决。

  • 6.支持投机性的容器和分布式调度,比如在没有资源可分配时仍可执行一个Applications

        Opportunistic Container引入新 Opportunistic 类型的 Container 后,这种 Container 可以利用节点上已分配但未真正使用的资源。原有 Container 类型定义为 Guaranteed 类型。相对于 Guaranteed 类型Container, Opportunistic 类型的Container优先级更低。

  • 7.MapReduce本地优化

        MapReduce添加了Map输出collector的本地实现。对于shuffle密集型的作业来说,这将会有30%以上的性能提升。更多内容请参见 MAPREDUCE-2841

  • 8.支持2个以上namenode

        初的HDFS NameNode high-availability实现仅仅提供了一个active NameNode和一个Standby NameNode;并且通过将编辑日志复制到三个JournalNodes上,这种架构能够容忍系统中的任何一个节点的失败。然而,一些部署需要更高的容错度。我们可以通过这个新特性来实现,其允许用户运行多个Standby NameNode。比如通过配置三个NameNode和五个JournalNodes,这个系统可以容忍2个节点的故障,而不是仅仅一个节点。HDFS high-availability文档已经对这些信息进行了更新,我们可以阅读这篇文档了解如何配置多于2个NameNodes。

  • 9.默认的端口和服务有改变

        在此之前,多个Hadoop服务的默认端口都属于Linux的临时端口范围(32768-61000)。这就意味着我们的服务在启动的时候可能因为和其他应用程序产生端口冲突而无法启动。现在这些可能会产生冲突的端口已经不再属于临时端口的范围,这些端口的改变会影响NameNode, Secondary NameNode, DataNode以及KMS。与此同时,官方文档也进行了相应的改变,具体可以参见 HDFS-9427以及HADOOP-12811。

       
namenode端口 namenode 8020 9820
namenode端口 namenode htttp web 50070 9870
namenode端口 namenode https web 50470 9871
secondnamenode端口 secondnamenode https web 50091 9869
secondnamenode端口 secondnamenode http web 50090 9868
datanode端口 datanode ipc 50020 9867
datanode端口 datanode 50010 9866
datanode端口 datanode http web 50075 9864
datanode端口 datanode https web 50475 9865

 

  • 10.支持微软Azure存储系统和阿里云存储系统
  • 11.新增内部节点的平衡器

        一个DataNode可以管理多个磁盘,正常写入操作,各磁盘会被均匀填满。然而,当添加或替换磁盘时可能导致此DataNode内部的磁盘存储的数据严重内斜。这种情况现有的HDFS balancer是无法处理的。这种情况是由新intra-DataNode平衡功能来处理,通过hdfs diskbalancer CLI来调用。更多请参考HDFS Commands Guide

  • 12.重做了后台程序和任务的堆内存管理

        Hadoop守护进程和MapReduce任务的堆内存管理发生了一系列变化。

HADOOP-10950:介绍了配置守护集成heap大小的新方法。主机内存大小可以自动调整,HADOOP_HEAPSIZE 已弃用。
MAPREDUCE-5785:map和reduce task堆大小的配置方法,所需的堆大小不再需要通过任务配置和Java选项实现。已经指定的现有配置不受此更改影响。
  • 13.针对S3文件系统支持DynamoDB存储

        HADOOP-13345 里面为 Amazon S3 存储系统的 S3A 客户端引入了一个新的可选特性,也就是可以使用 DynamoDB 表作为文件和目录元数据的快速一致的存储。

  • 14.HDFS支持基于路由器的联盟

        HDFS Router-Based Federation 添加了一个 RPC路由层,提供了多个 HDFS 命名空间的联合视图。与现有 ViewFs 和 HDFS Federation 功能类似,不同之处在于挂载表(mount table)由服务器端(server-side)的路由层维护,而不是客户端。这简化了现有 HDFS客户端 对 federated cluster 的访问。 详细请参见:HDFS-10467

  • 15.提供REST API来修改容量调度

        OrgQueue 扩展了 capacity scheduler ,通过 REST API 提供了以编程的方式来改变队列的配置,This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.。详细请参见:YARN-5734

  • 16.YARN的资源除了传统的CPU和内存外,还可以支持用户自定义的资源类型,比如GPU

        YARN 资源模型(YARN resource model)已被推广为支持用户自定义的可数资源类型(support user-defined countable resource types),不仅仅支持 CPU 和内存。比如集群管理员可以定义诸如 GPUs、软件许可证(software licenses)或本地附加存储器(locally-attached storage)之类的资源。YARN 任务可以根据这些资源的可用性进行调度。详细请参见: YARN-3926。

Apache Hadoop 3.0.0

Apache Hadoop 3.0.0 incorporates a number of significant enhancements over the previous major release line (hadoop-2.x).

This release is generally available (GA), meaning that it represents a point of API stability and quality that we consider production-ready.

Overview

Users are encouraged to read the full set of release notes. This page provides an overview of the major changes.

Minimum required Java version increased from Java 7 to Java 8

All Hadoop JARs are now compiled targeting a runtime version of Java 8. Users still using Java 7 or below must upgrade to Java 8.

Support for erasure coding in HDFS

Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.

Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.

More details are available in the HDFS Erasure Coding documentation.

YARN Timeline Service v.2

We are introducing an early preview (alpha 2) of a major revision of YARN Timeline Service: v.2. YARN Timeline Service v.2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation.

YARN Timeline Service v.2 alpha 2 is provided so that users and developers can test it and provide feedback and suggestions for making it a ready replacement for Timeline Service v.1.x. It should be used only in a test capacity.

More details are available in the YARN Timeline Service v.2 documentation.

Shell script rewrite

The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations.

Incompatible changes are documented in the release notes, with related discussion on HADOOP-9902.

More details are available in the Unix Shell Guide documentation. Power users will also be pleased by the Unix Shell API documentation, which describes much of the new functionality, particularly related to extensibility.

Shaded client jars

The hadoop-client Maven artifact available in 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can be problematic if the versions of these transitive dependencies conflict with the versions used by the application.

HADOOP-11804 adds new hadoop-client-api and hadoop-client-runtime artifacts that shade Hadoop’s dependencies into a single jar. This avoids leaking Hadoop’s dependencies onto the application’s classpath.

Support for Opportunistic Containers and Distributed Scheduling.

A notion of ExecutionType has been introduced, whereby Applications can now request for containers with an execution type of Opportunistic. Containers of this type can be dispatched for execution at an NM even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the default Guaranteed containers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.

Opportunistic containers are by default allocated by the central RM, but support has also been added to allow opportunistic containers to be allocated by a distributed scheduler which is implemented as an AMRMProtocol interceptor.

Please see documentation for more details.

MapReduce task-level native optimization

MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.

See the release notes for MAPREDUCE-2841 for more detail.

Support for more than 2 NameNodes.

The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.

However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.

The HDFS high-availability documentation has been updated with instructions on how to configure more than two NameNodes.

Default ports of multiple services have been changed.

Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.

These conflicting ports have been moved out of the ephemeral range, affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our documentation has been updated appropriately, but see the release notes for HDFS-9427 and HADOOP-12811 for a list of port changes.

Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors

Hadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as alternative Hadoop-compatible filesystems.

Intra-datanode balancer

A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-, DN skew.

This situation is handled by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI. See the disk balancer section in the HDFS Commands Guidefor more information.

Reworked daemon and task heap management

A series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.

HADOOP-10950 introduces new methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and the HADOOP_HEAPSIZEvariable has been deprecated. See the full release notes of HADOOP-10950 for more detail.

MAPREDUCE-5785 simplifies the configuration of map and reduce task heap sizes, so the desired heap size no longer needs to be specified in both the task configuration and as a Java option. Existing configs that already specify both are not affected by this change. See the full release notes of MAPREDUCE-5785 for more details.

S3Guard: Consistency and Metadata Caching for the S3A filesystem client

HADOOP-13345 adds an optional feature to the S3A client of Amazon S3 storage: the ability to use a DynamoDB table as a fast and consistent store of file and directory metadata.

See S3Guard for more details.

HDFS Router-Based Federation

HDFS Router-Based Federation adds a RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federationfunctionality, except the mount table is managed on the server-side by the routing layer rather than on the client. This simplifies access to a federated cluster for existing HDFS clients.

See HDFS-10467 and the HDFS Router-based Federation documentation for more details.

API-based configuration of Capacity Scheduler queue configuration

The OrgQueue extension to the capacity scheduler provides a programmatic way to change configurations by providing a REST API that users can call to modify queue configurations. This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.

See YARN-5734 and the Capacity Scheduler documentation for more information.

 

YARN Resource Types

The YARN resource model has been generalized to support user-defined countable resource types beyond CPU and memory. For instance, the cluster administrator could define resources like GPUs, software licenses, or locally-attached storage. YARN tasks can then be scheduled based on the availability of these resources.

See YARN-3926 and the YARN resource model documentation for more information.

文章评论

漫画:程序员的工作
漫画:程序员的工作
老美怎么看待阿里赴美上市
老美怎么看待阿里赴美上市
60个开发者不容错过的免费资源库
60个开发者不容错过的免费资源库
程序员最害怕的5件事 你中招了吗?
程序员最害怕的5件事 你中招了吗?
程序员的鄙视链
程序员的鄙视链
亲爱的项目经理,我恨你
亲爱的项目经理,我恨你
鲜为人知的编程真相
鲜为人知的编程真相
2013年美国开发者薪资调查报告
2013年美国开发者薪资调查报告
每天工作4小时的程序员
每天工作4小时的程序员
旅行,写作,编程
旅行,写作,编程
那些性感的让人尖叫的程序员
那些性感的让人尖叫的程序员
不懂技术不要对懂技术的人说这很容易实现
不懂技术不要对懂技术的人说这很容易实现
要嫁就嫁程序猿—钱多话少死的早
要嫁就嫁程序猿—钱多话少死的早
我是如何打败拖延症的
我是如何打败拖延症的
如何区分一个程序员是“老手“还是“新手“?
如何区分一个程序员是“老手“还是“新手“?
看13位CEO、创始人和高管如何提高工作效率
看13位CEO、创始人和高管如何提高工作效率
代码女神横空出世
代码女神横空出世
我的丈夫是个程序员
我的丈夫是个程序员
程序员和编码员之间的区别
程序员和编码员之间的区别
做程序猿的老婆应该注意的一些事情
做程序猿的老婆应该注意的一些事情
 程序员的样子
程序员的样子
那些争议最大的编程观点
那些争议最大的编程观点
老程序员的下场
老程序员的下场
为什么程序员都是夜猫子
为什么程序员都是夜猫子
Java 与 .NET 的平台发展之争
Java 与 .NET 的平台发展之争
科技史上最臭名昭著的13大罪犯
科技史上最臭名昭著的13大罪犯
写给自己也写给你 自己到底该何去何从
写给自己也写给你 自己到底该何去何从
10个帮程序员减压放松的网站
10个帮程序员减压放松的网站
当下全球最炙手可热的八位少年创业者
当下全球最炙手可热的八位少年创业者
程序员都该阅读的书
程序员都该阅读的书
编程语言是女人
编程语言是女人
如何成为一名黑客
如何成为一名黑客
十大编程算法助程序员走上高手之路
十大编程算法助程序员走上高手之路
“肮脏的”IT工作排行榜
“肮脏的”IT工作排行榜
聊聊HTTPS和SSL/TLS协议
聊聊HTTPS和SSL/TLS协议
为啥Android手机总会越用越慢?
为啥Android手机总会越用越慢?
Web开发人员为什么越来越懒了?
Web开发人员为什么越来越懒了?
程序员应该关注的一些事儿
程序员应该关注的一些事儿
5款最佳正则表达式编辑调试器
5款最佳正则表达式编辑调试器
10个调试和排错的小建议
10个调试和排错的小建议
我跳槽是因为他们的显示器更大
我跳槽是因为他们的显示器更大
程序猿的崛起——Growth Hacker
程序猿的崛起——Growth Hacker
2013年中国软件开发者薪资调查报告
2013年中国软件开发者薪资调查报告
初级 vs 高级开发者 哪个性价比更高?
初级 vs 高级开发者 哪个性价比更高?
一个程序员的时间管理
一个程序员的时间管理
程序员的一天:一寸光阴一寸金
程序员的一天:一寸光阴一寸金
什么才是优秀的用户界面设计
什么才是优秀的用户界面设计
程序员周末都喜欢做什么?
程序员周末都喜欢做什么?
程序员必看的十大电影
程序员必看的十大电影
软件开发程序错误异常ExceptionCopyright © 2009-2015 MyException 版权所有