大資料處理演算法--Bloom Filter

阿新 • • 發佈：2019-01-11

一、布隆過濾器（Bloom Filter）的定義

布隆過濾器可以用來檢測資料是否存在於一個集合中。它是hash的擴充套件，底層就是一個位數組，每一個bit位可以表示一個數字，所以布隆過濾器是基於點陣圖來實現的。

二、布隆過濾器的原理

1、插入資料

在點陣圖中，每一個bit對應一個數字，出現一個數字就可以在相應的位上置1。但是布隆過濾器不一樣，它要處理的不僅僅是整型還有其他如string型別的資料，因此，當大量的string類資訊需要處理的時候，難免會引起大量的衝突。布隆過濾器在這裡的處理是用多個bit位來表示string型別的資料，這多個bit位通過不同的hash函式得出，這樣衝突的概率就減少了。

2、刪除資料

布隆過濾器不支援刪除資料，因為一個數據型別是由好幾個bit位來表示的，難免幾個資料的bit位會重疊，如果刪除一個bit位，其他的資料也可能受到影響。那麼如何實現布隆過濾器的刪除呢？這裡可以引入引用計數的概念，將位陣列擴充套件位整型陣列，陣列的下標對應資料通過hash函式得出的數，陣列中存放的是數字出現的次數，即引用計數。

3、檢測資料

同插入資料時一樣，這裡的資料需要用幾個hash函式得出多個bit位，如果每個bit位都存在於位陣列中，那麼這個資料可能存在。為什麼說是可能存在呢，因為有可能這個資料的幾個bit位是由其他資料置為1的。如果幾個bit中有一個為0，那麼這個資料一定就不存在了。

三、布隆過濾器的特點

1、優點：它相比於hash，紅黑樹等結構更加節省空間，而且插入效率和查詢效率都遠遠超過一般演算法

2、缺點：不支援刪除操作；查詢的結果不一定準確（結果為不存在時是準確的，為存在時是不準確的）；

四、布隆過濾器的應用

像網易，QQ這樣的公眾電子郵件（email）提供商，總是需要過濾來自發送垃圾郵件的人（spamer）的垃圾郵件。

一個辦法就是記錄下那些發垃圾郵件的 email地址。由於那些傳送者不停地在註冊新的地址，全世界少說也有幾十億個發垃圾郵件的地址，將他們都存起來則需要大量的網路伺服器。

如果用雜湊表，每儲存一億個 email地址，就需要 1.6GB的記憶體（用雜湊表實現的具體辦法是將每一個 email地址對應成一個八位元組的資訊指紋，然後將這些資訊指紋存入雜湊表，由於雜湊表的儲存效率一般只有 50%，因此一個email地址需要佔用十六個位元組。一億個地址大約要 1.6GB，即十六億位元組的記憶體）。因此存貯幾十億個郵件地址可能需要上百 GB的記憶體。

而Bloom Filter只需要雜湊表 1/8到 1/4 的大小就能解決同樣的問題。

BloomFilter決不會漏掉任何一個在黑名單中的可疑地址。而至於誤判問題，常見的補救辦法是在建立一個小的白名單，儲存那些可能別誤判的郵件地址。

五、布隆過濾器的實現

#include<iostream>
#include "bitmap.h"
using namespace std;

struct HashFunc1
{
	static size_t BKDRHash(const char *str)
	{
		unsigned int seed = 131;
		unsigned int hash = 1;
		while (*str)
		{
			hash = hash * seed + (*str++);
		}

		return (hash & 0x7fffffff);
	}

	size_t operator()(const std::string &str)
	{
		return BKDRHash(str.c_str());
	}
};
struct HashFunc2
{
	static size_t BKDRHash(const char *str)
	{
		register size_t hash = 0;
		while (size_t ch = (size_t)*str++)
		{
			hash = hash * 131 + ch;
		}
		return hash;
	}

	size_t operator()(const std::string &str)
	{
		return BKDRHash(str.c_str());
	}
};
struct HashFunc3
{
	static size_t BKDRHash(const char *str)
	{
		if (!*str)        // 這是由本人新增，以保證空字串返回雜湊值0  
			return 0;
		register size_t hash = 1315423911;
		while (size_t ch = (size_t)*str++)
		{
			hash ^= ((hash << 5) + ch + (hash >> 2));
		}
		return hash;
	}

	size_t operator()(const std::string &str)
	{
		return BKDRHash(str.c_str());
	}
};
struct HashFunc4
{
	static size_t BKDRHash(const char *str)
	{
		register size_t hash = 0;
		size_t magic = 63689;
		while (size_t ch = (size_t)*str++)
		{
			hash = hash * magic + ch;
			magic *= 378551;
		}
		return hash;
	}

	size_t operator()(const std::string &str)
	{
		return BKDRHash(str.c_str());
	}
};
struct HashFunc5
{
	static size_t BKDRHash(const char *str)
	{
		register size_t hash = 0;
		size_t ch;
		for (long i = 0; ch = (size_t)*str++; i++)
		{
			if ((i & 1) == 0)
			{
				hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
			}
			else
			{
				hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
			}
		}
		return hash;
	}
	size_t operator()(const std::string &str)
	{
		return BKDRHash(str.c_str());
	}
};
template<class K=string,
class Hash1 = HashFunc1,
class Hash2 = HashFunc2,
class Hash3 = HashFunc3,
class Hash4 = HashFunc4,
class Hash5 = HashFunc5>
class Bloom
{
public:
	Bloom(size_t size)
		:_map(size)
	{}
	void Set(string str1)
	{
		size_t hash1 = HashFunc1()(str1);
		_map.Set(hash1%_map.Size());
		size_t hash2 = HashFunc2()(str1);
		_map.Set(hash2%_map.Size());
		size_t hash3 = HashFunc3()(str1);
		_map.Set(hash3%_map.Size());
		size_t hash4 = HashFunc4()(str1);
		_map.Set(hash4%_map.Size());
		size_t hash5 = HashFunc5()(str1);
		_map.Set(hash5%_map.Size());
	}

	void Unset()
	{

	}

	bool Test(string str1)
	{
		size_t hash1 = HashFunc1()(str1);
		if (false == _map.test(hash1%_map.Size()))
			return false;
		size_t hash2 = HashFunc2()(str1);
		if (false == _map.test(hash2%_map.Size()))
			return false;
		size_t hash3 = HashFunc3()(str1);
		if (false == _map.test(hash3%_map.Size()))
			return false;
		size_t hash4 = HashFunc4()(str1);
		if (false == _map.test(hash4%_map.Size()))
			return false;
		size_t hash5 = HashFunc5()(str1);
		if (false == _map.test(hash5%_map.Size()))
			return false;
		return true;
	}

private:
	bitmap _map;
};

以下是點陣圖的實現

#pragma once
#include<iostream>
#include <vector>
using namespace std;

class bitmap
{
public:
	bitmap(size_t size)
		:_size(size)
	{
		map = new int[(size >> 5) + 1];
		memset(map, 0, sizeof(map));
	}
	~bitmap()
	{
		delete[]map;
	}
	void Set(size_t num)
	{
		int index = num >> 5;
		int pos = num % 32;
		map[index] |= (1 << (pos-1));
	}
	void Unset(size_t num)
	{
		int index = num >> 5;
		int pos = num % 32;
		map[index] &= ~(1 << (pos-1));
	}

	bool test(size_t num)
	{
		int index = num >> 5;
		int pos = num % 32;
		if (((map[index]>>(pos-1))&1)==1)
			return true;
		return false;
	}
	size_t Size()
	{
		return _size;
	}
private:
	int*map;
	size_t _size;
};

大資料處理演算法--Bloom Filter

大資料處理演算法--Bloom Filter

海量資料處理演算法—Bloom Filter

大資料處理演算法三：分而治之/hash對映 + hash統計 + 堆/快速/歸併排序

海量資料處理之Bloom Filter詳解

大資料經典演算法——bit-map與bloom filter

《資料演算法：Hadoop_Spark大資料處理技巧》艾提拉筆記.docx 第1章二次排序：簡介 19 第2章二次排序：詳細示例 42 第3章 Top 10 列表 54 第4章左外連線 96 第5

（二）大資料處理：基於MapReduce的大圖劃分演算法綜述

《資料演算法-Hadoop/Spark大資料處理技巧》讀書筆記（一）——二次排序

《資料演算法-Hadoop/Spark大資料處理技巧》讀書筆記（四）——移動平均

經典演算法題：大資料處理常見演算法題

php 大資料量及海量資料處理演算法總結

DKhadoop大資料處理平臺監控資料介紹

大資料（演算法知識）

淺談大資料處理

大資料處理神器map-reduce實現(僅python和shell版本)

海量資料處理演算法—Bit-Map

Hadoop Streaming 做大資料處理詳解

最主流的五個大資料處理框架的優勢對比

Python大資料處理庫PySpark實戰

大資料處理——雙層桶

大資料處理演算法--Bloom Filter

相關推薦