HADOOP 優化（7）：Hadoop綜合調優(2)企業開發場景案例

阿新 • • 發佈：2021-09-05

3.1需求

（1）需求：從1G資料中，統計每個單詞出現次數。伺服器3臺，每臺配置4G記憶體，4核CPU，4執行緒。

（2）需求分析：

1G/ 128m = 8個MapTask；1個ReduceTask；1個mrAppMaster

平均每個節點執行10個/ 3臺 ≈3個任務（4 3 3）

3.2 HDFS引數調優

（1）修改：hadoop-env.sh

（3）修改core-site.xml

<!-- 配置垃圾回收時間為60分鐘 -->
<property>
    <name>fs.trash.interval</name>
    <value>60 
</value>
</property>

（4）分發配置

[atguigu@hadoop102 hadoop]$ xsync hadoop-env.sh hdfs-site.xml core-site.xml

3.3 MapReduce引數調優

（1）修改mapred-site.xml

<!-- 環形緩衝區大小，預設100m -->
<property>
  <name>mapreduce.task.io.sort.mb</name>
  <value>100</value>
</property>

<!-- 環形緩衝區溢寫閾值，預設0.8 
 -->
<property>
  <name>mapreduce.map.sort.spill.percent</name>
  <value>0.80</value>
</property>

<!-- merge合併次數，預設10個 -->
<property>
  <name>mapreduce.task.io.sort.factor</name>
  <value>10</value>
</property>

<!-- maptask記憶體，預設1g； maptask堆記憶體大小預設和該值大小一致mapreduce.map.java.opts -->
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>-1 
</value>
  <description>The amount of memory to request from the scheduler for each    map task. If this is not specified or is non-positive, it is inferred from mapreduce.map.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024.
  </description>
</property>

<!-- matask的CPU核數，預設1個 -->
<property>
  <name>mapreduce.map.cpu.vcores</name>
  <value>1</value>
</property>

<!-- matask異常重試次數，預設4次 -->
<property>
  <name>mapreduce.map.maxattempts</name>
  <value>4</value>
</property>

<!-- 每個Reduce去Map中拉取資料的並行數。預設值是5 -->
<property>
  <name>mapreduce.reduce.shuffle.parallelcopies</name>
  <value>5</value>
</property>

<!-- Buffer大小佔Reduce可用記憶體的比例，預設值0.7 -->
<property>
  <name>mapreduce.reduce.shuffle.input.buffer.percent</name>
  <value>0.70</value>
</property>

<!-- Buffer中的資料達到多少比例開始寫入磁碟，預設值0.66。 -->
<property>
  <name>mapreduce.reduce.shuffle.merge.percent</name>
  <value>0.66</value>
</property>

<!-- reducetask記憶體，預設1g；reducetask堆記憶體大小預設和該值大小一致mapreduce.reduce.java.opts -->
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>-1</value>
  <description>The amount of memory to request from the scheduler for each    reduce task. If this is not specified or is non-positive, it is inferred
    from mapreduce.reduce.java.opts and mapreduce.job.heap.memory-mb.ratio.
    If java-opts are also not specified, we set it to 1024.
  </description>
</property>

<!-- reducetask的CPU核數，預設1個 -->
<property>
  <name>mapreduce.reduce.cpu.vcores</name>
  <value>2</value>
</property>

<!-- reducetask失敗重試次數，預設4次 -->
<property>
  <name>mapreduce.reduce.maxattempts</name>
  <value>4</value>
</property>

<!-- 當MapTask完成的比例達到該值後才會為ReduceTask申請資源。預設是0.05 -->
<property>
  <name>mapreduce.job.reduce.slowstart.completedmaps</name>
  <value>0.05</value>
</property>

<!-- 如果程式在規定的預設10分鐘內沒有讀到資料，將強制超時退出 -->
<property>
  <name>mapreduce.task.timeout</name>
  <value>600000</value>
</property>

（2）分發配置

[atguigu@hadoop102 hadoop]$ xsync mapred-site.xml

3.4 Yarn引數調優

（1）修改yarn-site.xml配置引數如下:

<!-- 選擇排程器，預設容量 -->
<property>
    <description>The class to use as the resource scheduler.</description>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

<!-- ResourceManager處理排程器請求的執行緒數量,預設50；如果提交的任務數大於50，可以增加該值，但是不能超過3臺 * 4執行緒 = 12執行緒（去除其他應用程式實際不能超過8） -->
<property>
    <description>Number of threads to handle scheduler interface.</description>
    <name>yarn.resourcemanager.scheduler.client.thread-count</name>
    <value>8</value>
</property>

<!-- 是否讓yarn自動檢測硬體進行配置，預設是false，如果該節點有很多其他應用程式，建議手動配置。如果該節點沒有其他應用程式，可以採用自動 -->
<property>
    <description>Enable auto-detection of node capabilities such as
    memory and CPU.
    </description>
    <name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
    <value>false</value>
</property>

<!-- 是否將虛擬核數當作CPU核數，預設是false，採用物理CPU核數 -->
<property>
    <description>Flag to determine if logical processors(such as
    hyperthreads) should be counted as cores. Only applicable on Linux
    when yarn.nodemanager.resource.cpu-vcores is set to -1 and
    yarn.nodemanager.resource.detect-hardware-capabilities is true.
    </description>
    <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
    <value>false</value>
</property>

<!-- 虛擬核數和物理核數乘數，預設是1.0 -->
<property>
    <description>Multiplier to determine how to convert phyiscal cores to
    vcores. This value is used if yarn.nodemanager.resource.cpu-vcores
    is set to -1(which implies auto-calculate vcores) and
    yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The    number of vcores will be calculated as    number of CPUs * multiplier.
    </description>
    <name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
    <value>1.0</value>
</property>

<!-- NodeManager使用記憶體數，預設8G，修改為4G記憶體 -->
<property>
    <description>Amount of physical memory, in MB, that can be allocated 
    for containers. If set to -1 and
    yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
    automatically calculated(in case of Windows and Linux).
    In other cases, the default is 8192MB.
    </description>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
</property>

<!-- nodemanager的CPU核數，不按照硬體環境自動設定時預設是8個，修改為4個 -->
<property>
    <description>Number of vcores that can be allocated
    for containers. This is used by the RM scheduler when allocating
    resources for containers. This is not used to limit the number of
    CPUs used by YARN containers. If it is set to -1 and
    yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
    automatically determined from the hardware in case of Windows and Linux.
    In other cases, number of vcores is 8 by default.</description>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>4</value>
</property>

<!-- 容器最小記憶體，預設1G -->
<property>
    <description>The minimum allocation for every container request at the RM    in MBs. Memory requests lower than this will be set to the value of this    property. Additionally, a node manager that is configured to have less memory    than this value will be shut down by the resource manager.
    </description>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
</property>

<!-- 容器最大記憶體，預設8G，修改為2G -->
<property>
    <description>The maximum allocation for every container request at the RM    in MBs. Memory requests higher than this will throw an    InvalidResourceRequestException.
    </description>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>2048</value>
</property>

<!-- 容器最小CPU核數，預設1個 -->
<property>
    <description>The minimum allocation for every container request at the RM    in terms of virtual CPU cores. Requests lower than this will be set to the    value of this property. Additionally, a node manager that is configured to    have fewer virtual cores than this value will be shut down by the resource    manager.
    </description>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
</property>

<!-- 容器最大CPU核數，預設4個，修改為2個 -->
<property>
    <description>The maximum allocation for every container request at the RM    in terms of virtual CPU cores. Requests higher than this will throw an
    InvalidResourceRequestException.</description>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>2</value>
</property>

<!-- 虛擬記憶體檢查，預設開啟，修改為關閉 -->
<property>
    <description>Whether virtual memory limits will be enforced for
    containers.</description>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

<!-- 虛擬記憶體和實體記憶體設定比例,預設2.1 -->
<property>
    <description>Ratio between virtual memory to physical memory when    setting memory limits for containers. Container allocations are    expressed in terms of physical memory, and virtual memory usage    is allowed to exceed this allocation by this ratio.
    </description>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>2.1</value>
</property>