大資料（十二）：自定義OutputFormat與ReduceJoin合併（資料傾斜）

阿新 • • 發佈：2018-11-10

一、OutputFormat介面

OutputFormat是MapReduce輸出的基類，所有實現MapReduce輸出都實現了OutputFormat介面。

1.文字輸出TextOutPutFormat

預設的輸出格式是TextOutputFormat，它把每條記錄寫為文字行。他的鍵和值可以是任意型別，會通過toString()方法吧他們轉換為在字串。

2.SequenceFileOutputFormat

SequenceFileOutputFormat將它的輸出寫為一個順序檔案。如果輸出需要作為後續MapReduce任務的輸出，這便是一種很好的輸出格式，因為它的格式緊湊，很容易被壓縮。

3.自定義OutputFormat

二、自定義OutputFormat

為了實現控制最終檔案的輸出路徑，可以自定義OutputFormat

在一個MapReduce程式中更具資料的不同輸出兩類結果到不同目錄，這種靈活的輸出需求就需要通過自定義outputformat來實現。

1.自定義OutputFormat步驟

自定義一個類繼承FileOutputFormat
改寫recordwriter，重寫輸出資料的方法write()

三、.過濾文字內容及自定義檔案輸出路徑（自定義OutputFormat）

1.需求

過濾輸入的log日誌中是否包含.com

以com結尾的網站輸出到d:/com.log裡
不以com結尾的網站輸出到d:/other.log裡

2.輸入資料

一個名叫log.txt的日誌檔案裡包含多條url

3.自定義OutputFormat

public class FilterOutputFormat extends FileOutputFormat<Text,NullWritable>{
    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        return new FilterRecordWriter(taskAttemptContext);
    }
}

4.自定義RecordWriter

public class FilterRecordWriter extends RecordWriter<Text, NullWritable> {
    private Configuration configuration;
    private FSDataOutputStream comFs = null;
    private FSDataOutputStream otherFs = null;
    public FilterRecordWriter() {
    }

    public FilterRecordWriter(TaskAttemptContext job){
        configuration = job.getConfiguration();
        //獲取檔案系統
        FileSystem fileSystem = null;
        try {
        fileSystem = FileSystem.get(configuration);
        //建立兩個輸出流
        comFs = fileSystem.create(new Path("d:/com.log"));
        otherFs = fileSystem.create(new Path("d:/other.log"));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void write(Text text, NullWritable nullWritable) throws IOException, InterruptedException {
        //判斷資料是否包含com
        if (text.toString().contains("com")){
            comFs.write(text.getBytes());
        }else {
            otherFs.write(text.getBytes());
        }
    }

    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        //關閉流
        if (comFs !=null){
            comFs.close();
        }
        if (otherFs !=null){
            otherFs.close();
        }
    }
}

5.編寫Mapper程式碼

public class FilterMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
    Text k = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //獲取一行資料
        String line = value.toString();
        //設定key
        k.set(line);
        //輸出
        context.write(k,NullWritable.get());
    }
}

6.編寫Reducer程式碼

public class FilterReducer extends Reducer<Text,NullWritable,Text,NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        String k = key.toString() + "\r\n";
        context.write(new Text(k),NullWritable.get());
    }
}

7.編寫Driver

public class FilterDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //獲取配置資訊
        Configuration conf=new Configuration();
        Job job = Job.getInstance(conf);
        //設定jar包載入路徑
        job.setJarByClass(FilterDriver.class);
        //載入map/reduce類
        job.setMapperClass(FilterMapper.class);
        job.setReducerClass(FilterReducer.class);
        //設定OutFormat
        job.setOutputFormatClass(FilterOutputFormat.class);
        //設定map輸出資料key和value型別
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        //設定最終輸出資料key和value型別
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        //設定輸入資料和輸出資料路徑
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        //提交
        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }
}

四、ReduceJoin

1.原理

Map端的主要工作：為來自不同表（檔案）的key/value對打標籤以區別不同來源的記錄。然後用連線欄位作為key，其餘部分和新加的標誌作為value，最後進行輸出。

Reduce端的工作：在reduce端以連線欄位作為key的分組已經完成，只需要在每一個分組當中將那些來源不同檔案的記錄分開，最後進行合併就可以了。

2.缺點

這種方法的缺點比較明顯會造成shuffle階段出現大量的資料傳輸，效率低下。

五、MapReduce中多表合併案例（資料傾斜）

1.需求

訂單資料表t_order（檔名為order.txt）

id	pid	amount
1001	01	1
1002	02	2
1003	03	3

商品資訊表t_product（檔名為pd.txt）

pid	pname
01	小米
02	華為
03	格力

將商品資訊表中資料根據商品pid合併到訂單資料表中

最終結果：

id	pid	amount
1001	小米	1
1002	華為	2
1003	格力	3

2.程式分析

mapper中處理邏輯
1. 獲取輸入檔案型別
2. 獲取輸入資料
3. 不同檔案分別處理
4. 封裝bean物件輸出
預設對產品id排序
reduce方法快取訂單資料集合和資料表然後再合併

3.編寫Bean程式碼

public class TableBean implements Writable {
    /**
    * 訂單id
    */
    private String orderId;
    /**
    * 產品id
    */
    private String pid;
    /**
    * 產品數量
    */
    private int amount;
    /**
    * 產品名稱
    */
    private String pName;
    /**
    * 標記是訂單表（0）還是產品表（1）
    */
    private String flag;

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(orderId);
        dataOutput.writeUTF(pid);
        dataOutput.writeInt(amount);
        dataOutput.writeUTF(pName);
        dataOutput.writeUTF(flag);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.orderId = dataInput.readUTF();
        this.pid = dataInput.readUTF();
        this.amount = dataInput.readInt();
        this.pName = dataInput.readUTF();
        this.flag = dataInput.readUTF();
    }

    @Override
    public String toString() {
        return orderId + "/t" + pName + "/t" + amount;
    }

    public String getOrderId() {
        return orderId;
    }

    public void setOrderId(String orderId) {
        this.orderId = orderId;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public int getAmount() {
        return amount;
    }

    public void setAmount(int amount) {
        this.amount = amount;
    }

    public String getpName() {
        return pName;
    }

    public void setpName(String pName) {
        this.pName = pName;
    }

    public String getFlag() {
        return flag;
    }

    public void setFlag(String flag) {
        this.flag = flag;
    }
}

4.編寫Mapper程式碼

public class TableMapper extends Mapper<LongWritable, Text, Text, TableBean> {
    TableBean v = new TableBean();
    Text k = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //區分兩張表
        FileSplit split = (FileSplit) context.getInputSplit();
        String name = split.getPath().getName();

        //獲取一行資料
        String line = value.toString();
        //切割資料
        String[] fields = line.split("\t");
        if (name.startsWith("order")) {
            //訂單表
            v.setOrderId(fields[0]);
            v.setPid(fields[1]);
            v.setAmount(Integer.parseInt(fields[2]));
            v.setpName("");
            v.setFlag("0");

            k.set(fields[1]);
        } else {
            //產品表
            v.setOrderId("");
            v.setPid(fields[0]);
            v.setAmount(0);
            v.setpName(fields[1]);
            v.setFlag("1");
            k.set(fields[0]);
        }
        context.write(k, v);
    }
}

5.編寫Reducer程式碼

public class TableReducer extends Reducer<Text, TableBean, TableBean, NullWritable> {
    @Override
    protected void reduce(Text key, Iterable<TableBean> values, Context context) throws IOException, InterruptedException {
        //準備集合
        List<TableBean> orderBeans = new ArrayList<>();
       TableBean pdBean = new TableBean();

        //資料拷貝
        for (TableBean value : values) {
            if ("0".equals(value.getFlag())) {
                //訂單表
                TableBean tableBean = new TableBean();
                try {
                    BeanUtils.copyProperties(tableBean, value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
                orderBeans.add(tableBean);
            } else {
                //產品表
                try {
                    BeanUtils.copyProperties(pdBean, value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
            }
        }
        //拼接表
        for (TableBean orderBean : orderBeans) {
            orderBean.setpName(pdBean.getpName());
            //輸出
            context.write(orderBean, NullWritable.get());
        }
    }
}

6.編寫Driver程式碼

public class TableDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //獲取配置資訊，或者job物件例項
        Configuration entries = new Configuration();
        Job job = Job.getInstance(entries);
        //指定程式的jar包所在位置
        job.setJarByClass(TableDriver.class);
        //指定jbo要是的mapper和Reducer
        job.setMapperClass(TableMapper.class);
        job.setReducerClass(TableReducer.class);
        //指定mapper的輸出
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(TableBean.class);
        //指定最終輸出
        job.setOutputKeyClass(TableBean.class);
        job.setOutputValueClass(NullWritable.class);
        //指定job輸入原始檔案的目錄和輸出路徑
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        //執行
        boolean flag = job.waitForCompletion(true);
        System.exit(flag ? 0 : 1);
    }
}

大資料（十二）：自定義OutputFormat與ReduceJoin合併（資料傾斜）

一、OutputFormat介面 OutputFormat是MapReduce輸出的基類，所有實現MapReduce輸出都實現了OutputFormat介面。 1.文字輸出TextOutPutFormat &n

SpringBoot第十二集：度量指標監控與非同步呼叫（2020最新最易懂）

SpringBoot第十二集：度量指標監控與非同步呼叫（2020最新最易懂）　　Spring Boot Actuator是spring boot專案一個監控模組，提供了很多原生的端點，包含了對應用系統的自省和監控的整合功能，比如應用程式上下文裡全部的Bean、執行狀況檢查、健康指標、環境變數及各類重要度量指

搭建自己的部落格（二十七）：自定義使用者模型

2、變化的部分 {% load staticfiles %} <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <!--

大資料（十）：MapTask工作機制與Shuffle機制（partitioner輸出分割槽、WritableComparable排序）

一、MapTask工作機制 Read階段：MapTask通過使用者編寫的RecordReader，從輸入InputSplit中解析出一個個key/value Map階段：該節點主要是將解析出的key/value交給使用者編寫map()函式處理，併產生一系列

Android項目實戰（十）：自定義倒計時的TextView

初始 als time class nts 時間自定義計時 err 原文:Android項目實戰（十）：自定義倒計時的TextView項目總結 --------------------------------------------------------------

Android項目實戰（十五）：自定義不可滑動的ListView和GridView

con app lis androi color max XP xtend exp 原文:Android項目實戰（十五）：自定義不可滑動的ListView和GridView不可滑動的ListView (RecyclweView類似) public class NoSc

docker進階：自定義映象、網路架構（二）

一、製作自定義映象（docker commit）要求：基於centos映象使用commit建立新的映象檔案。 1、使用映象啟動容器在該容器基礎上修改yum源 [[email protected] docker_images]# docker run

Unity Editor 基礎篇（二）：自定義 Inspector 面板

自定義Inspector屬性面板 EditorGUILayout 編輯器介面佈局這是一個編輯器類，如果想使用它你需要把它放到工程目錄下的Assets/Editor資料夾下。編輯器類在UnityEditor名稱空間下。所以當使用C#指令碼時，你需要在指令碼前面加上

Unity Editor 基礎篇（三）：自定義視窗案例二

本文為本人學習上鍊接的筆記微有改動，請點選以上鍊接檢視原文，尊重樓主智慧財產權。 ----------------------------------------------------------------------------------------------

Java開發小技巧（二）：自定義Maven依賴

我們在專案開發中經常會將一些通用的類、方法等內容進行打包，打造成我們自己的開發工具包，作為各個專案的依賴來使用。思路一般的做法是將專案匯出成Jar包，然後在其它專案中將其匯入，看起來很輕鬆，但是存在一個問題，如果你修改了Jar包的內容，豈不是要每個專案都重新匯入，這顯

Spring Cloud Stream消費失敗後的處理策略（二）：自定義錯誤處理邏輯

應用場景上一篇《Spring Cloud Stream消費失敗後的處理策略（一）：自動重試》介紹了預設就會生效的訊息重試功能。對於一些因環境原因、網路抖動等不穩定因素引發的問題可以起到比較好的作用。但是對於諸如程式碼本身存在的邏輯錯誤等，無論重試多少次都不可能成功的問題，是無法修復的。對於這樣的情況，前文

SODBASE CEP學習（十七）：自定義函式開發

前面的文章已經多次提到自定義函式，對JAVA開發熟悉的讀者，只要自己實現一個類的public方法，就可以當做自定義函式在EPL中使用。部署時，程式碼然後打成jar包放到lib目錄下即可。如果對這個流程不熟悉也不要緊，本文提供一個示例，按步驟就可以做自定義函式 1 使用場景

CAS實現單點登入（二）：自定義的使用者驗證登入

上一篇演示單點登入服務端認證機制採用的是cas server預設的使用者名稱和密碼（admin/admin）。今天介紹正常專案中如何通過驗證DB中的使用者資料，來驗證使用者的密碼的合法性自定義驗證登入有兩種方式：採用cas-server預設的資料庫查詢

Spring-Cloud-Ribbon學習筆記（二）：自定義負載均衡規則

lan cse 重新啟動 ping for obi .config 流行 prope Ribbon自定義負載均衡策略有兩種方式，一是JavaConfig，一是通過配置文件（yml或properties文件）。需求假設我有包含A和B服務在內的多個微服務，它們均註冊在一個E

每天學點SpringCloud（三）：自定義Eureka集群負載均衡策略

log util domain 避免 can val 如果 dba filters 相信看了每天學點SpringCloud（一）：簡單服務提供者消費者調用，每天學點SpringCloud（二）：服務註冊與發現Eureka這兩篇的同學都了解到了我的套路，沒錯，本篇博客同樣是

[Golang] 從零開始寫Socket Server（2）：自定義通訊協議

在上一章我們做出來一個最基礎的demo後，已經可以初步實現Server和Client之間的資訊交流了~ 這一章我會介紹一下怎麼在Server和Client之間實現一個簡單的通訊協議，從而增強整個資訊交流過程的穩定性。

iOS開發簡記（2）：自定義tabbar

tabbar是放在APP底部的控制元件。常見的APP都使用tabbar來進行功能分類的管理，比如微信、QQ等等。小程需要一個特殊一點的tabbar，要求突顯中間的那個按鈕，讓中間按鈕特別顯眼，從而引導使用者去點選。所以，讓中間按鈕大於兩邊的按鈕，是一個基本的要求。使用常規的UITabBar跟UIT

解讀ASP.NET 5 & MVC6系列（16）：自定義View檢視檔案查詢邏輯

之前MVC5和之前的版本中，我們要想對View檔案的路徑進行控制的話，則必須要對IViewEngine介面的FindPartialView或FindView方法進行重寫，所有的檢視引擎都繼承於該IViewEngine介面，比如預設的RazorViewEngine。但新版本MVC6中，對檢視檔案的路徑方式卻不太

2：自定義註解日誌脫敏（半原創）

import java.lang.annotation.Documented; import java.lang.annotation.ElementType; import java.lang.annotation.Retention; import j

Swiper（三）：自定義點選事件swiper跳到指定頁面

選擇哪個按鈕就會跳到指定的頁面，這樣我們就可以自己來定義swiper分頁器的功能了。下面的例子可以應用在點選地圖上的使用者頭像，則滾動到到對應使用者的資訊輪播圖 swiper的slideTo方法，swiper.slideTo(index,speed,runCal

大資料（十二）：自定義OutputFormat與ReduceJoin合併（資料傾斜）

相關推薦