汽車之家店鋪資料抓取 DotnetSpider實戰[一]

阿新 • • 發佈：2019-01-18

一、背景

春節也不能閒著，一直想學一下爬蟲怎麼玩，網上搜了一大堆，大多都是Python的，大家也比較活躍，文章也比較多，找了一圈，發現園子裡面有個大神開發了一個DotNetSpider的開源庫，很值得慶幸的，該庫也支援.Net Core，於是乘著春節的空檔研究一下整個開源專案，順便實戰一下。目前網際網路汽車行業十分火熱，淘車，人人車，易車，汽車之家，所以我選取了汽車之家，芒果汽車這個店鋪，對資料進行抓取。

二、開發環境

VS2017+.Net Core2.x+DotNetSpider+Win10

三、開發

3.1新建.Net Core專案

新建一個.Net Core 控制檯應用

640?wx_fmt=png

3.2通過Nuget新增DotNetSpider類庫

搜尋DotnetSpider，新增這兩個庫就行了

640?wx_fmt=png

3.3分析需要抓取的網頁地址

開啟該網頁https://store.mall.autohome.com.cn/83106681.html，紅框區域就是我們要抓取的資訊。

640?wx_fmt=png

我們通過Chrome的開發工具的Network抓取到這些資訊的介面，在裡面可以很清楚的知道HTTP請求中所有的資料，包括Header，Post引數等等，其實我們把就是模擬一個HTTP請求，加上對HTML的一個解析就可以將資料解析出來。

640?wx_fmt=png

引數page就是頁碼，我們只需要修改page的值就可以獲取指定頁碼的資料了。

640?wx_fmt=png

返回結果就是列表頁的HTML。

640?wx_fmt=png

3.4建立儲存實體類AutoHomeShopListEntity

class AutoHomeShopListEntity : SpiderEntity

{

public string DetailUrl { get; set; }

public string CarImg { get; set; }

public string Price { get; set; }

public string DelPrice { get; set; }

public string Title { get; set; }

public string Tip { get; set; }

public string BuyNum { get; set; }

public override string ToString()

{

return $"{Title}|{Price}|{DelPrice}|{BuyNum}";

}

3.5建立AutoHomeProcessor

用於對於獲取到的HTML進行解析並且儲存

private class AutoHomeProcessor : BasePageProcessor

{

protected override void Handle(Page page)

{

List<AutoHomeShopListEntity> list = new List<AutoHomeShopListEntity>();

var modelHtmlList = page.Selectable.XPath(".//div[@class='list']/ul[@class='fn-clear']/li[@class='carbox']").Nodes();

foreach (var modelHtml in modelHtmlList)

{

AutoHomeShopListEntity entity = new AutoHomeShopListEntity();

entity.DetailUrl = modelHtml.XPath(".//a/@href").GetValue();

entity.CarImg = modelHtml.XPath(".//a/div[@class='carbox-carimg']/img/@src").GetValue();

var price = modelHtml.XPath(".//a/div[@class='carbox-info']").GetValue(DotnetSpider.Core.Selector.ValueOption.InnerText).Trim().Replace(" ", string.Empty).Replace("\n", string.Empty).Replace("\t", string.Empty).TrimStart('¥').Split("¥");

if (price.Length > 1)

{

entity.Price = price[0];

entity.DelPrice = price[1];

}

else

{

entity.Price = price[0];

entity.DelPrice = price[0];

}

entity.Title = modelHtml.XPath(".//a/div[@class='carbox-title']").GetValue();

entity.Tip = modelHtml.XPath(".//a/div[@class='carbox-tip']").GetValue();

entity.BuyNum = modelHtml.XPath(".//a/div[@class='carbox-number']/span").GetValue();

list.Add(entity);

}

page.AddResultItem("CarList", list);

}

3.6建立AutoHomePipe

用於輸出抓取到的結果。

private class AutoHomePipe : BasePipeline

{

public override void Process(IEnumerable<ResultItems> resultItems, ISpider spider)

{

foreach (var resultItem in resultItems)

{

Console.WriteLine((resultItem.Results["CarList"] as List<AutoHomeShopListEntity>).Count);

foreach (var item in (resultItem.Results["CarList"] as List<AutoHomeShopListEntity>))

{

Console.WriteLine(item);

}

3.7建立Site

主要就是將HTTP的Header部資訊放進去

var site = new Site

{

CycleRetryTimes = 1,

SleepTime = 200,

Headers = new Dictionary<string, string>()

{

{ "Accept","text/html, */*; q=0.01" },

{ "Referer", "https://store.mall.autohome.com.cn/83106681.html"},

{ "Cache-Control","no-cache" },

{ "Connection","keep-alive" },

{ "Content-Type","application/x-www-form-urlencoded; charset=UTF-8" },

{ "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36"}

}

};

3.8構造Request

因為我們所抓取到的介面必須用POST，如果是GET請求則這一部可以省略，引數就放在PostBody就行。

List<Request> resList = new List<Request>();

for (int i = 1; i <= 33; i++)

{

Request res = new Request();

res.PostBody = $"id=7&j=%7B%22createMan%22%3A%2218273159100%22%2C%22createTime%22%3A1518433690000%2C%22row%22%3A5%2C%22siteUserActivityListId%22%3A8553%2C%22siteUserPageRowModuleId%22%3A84959%2C%22topids%22%3A%22%22%2C%22wherePhase%22%3A%221%22%2C%22wherePreferential%22%3A%220%22%2C%22whereUsertype%22%3A%220%22%7D&page={i}&shopid=83106681";

res.Url = "https://store.mall.autohome.com.cn/shop/ajaxsitemodlecontext.jtml";

res.Method = System.Net.Http.HttpMethod.Post;

resList.Add(res);

}

3.9構造爬蟲並且執行

var spider = Spider.Create(site, new QueueDuplicateRemovedScheduler(), new AutoHomeProcessor())
          .AddStartRequests(resList.ToArray())
           .AddPipeline(new AutoHomePipe());
            spider.ThreadNum = 1;
            spider.Run();

3.10執行結果

640?wx_fmt=png

四、下次預告

接下來我會將對商品的詳情頁資料（包括車型引數配置之類的）進行抓取，介面已經抓取到了，還在思考如果更加便捷獲取到商品id，因為目前來看商品id是儲存在頁面的js全域性變數中，抓取起來比較費勁。

640?wx_fmt=png

五、總結

.Net 相對於別的語言感覺並不是那麼活躍，DotnetSpider雖然時間不長，但是希望園子裡面大夥都用起來，讓他不斷的發展，讓我們的.Net能夠更好的發展。

原文地址: https://www.cnblogs.com/FunnyBoy/p/8453338.html

.NET社群新聞，深度好文，歡迎訪問公眾號文章彙總 http://www.csharpkit.com

640?wx_fmt=jpeg

汽車之家店鋪資料抓取 DotnetSpider實戰[一]

一、背景

二、開發環境

三、開發

3.1新建.Net Core專案

3.2通過Nuget新增DotNetSpider類庫

3.3分析需要抓取的網頁地址

3.4建立儲存實體類AutoHomeShopListEntity

3.5建立AutoHomeProcessor

3.6建立AutoHomePipe

3.7建立Site

3.8構造Request

3.9構造爬蟲並且執行

3.10執行結果

四、下次預告

五、總結

汽車之家店鋪資料抓取 DotnetSpider實戰[一]

汽車之家店鋪數據抓取 DotnetSpider實戰[一]

汽車之家店鋪數據抓取 DotnetSpider實戰

使用python抓取汽車之家車型資料

汽車之家口碑資料的爬蟲

Twitter資料抓取的方法(一)

使用Java抓取解析汽車之家車型配置資料

java 開發用到網路爬蟲，抓取汽車之家網站全部資料經歷

汽車之家資料爬取:文章連結//圖片//標題

RCurl汽車之家抓取

python入門-----爬取汽車之家新聞,---自動登錄抽屜並點贊,

爬取汽車之家

python3 爬取汽車之家所有車型操作步驟

python網路爬蟲爬取汽車之家的最新資訊和照片

爬取汽車之家北京二手車資訊

汽車之家網站為例-爬蟲的編寫，爬取圖片

網頁資料抓取之讀取網頁資料

Python爬取最新反爬蟲汽車之家口碑

python爬蟲實戰爬取汽車之家上車型價格

WebMagic爬蟲入門教程（三）爬取汽車之家的例項-品牌車系車型結構等

汽車之家店鋪資料抓取 DotnetSpider實戰[一]

一、背景

二、開發環境

三、開發

3.1新建.Net Core專案

3.2通過Nuget新增DotNetSpider類庫

3.3分析需要抓取的網頁地址

3.4建立儲存實體類AutoHomeShopListEntity

3.5建立AutoHomeProcessor

3.6建立AutoHomePipe

3.7建立Site

3.8構造Request

3.9構造爬蟲並且執行

3.10執行結果

四、下次預告

五、總結

相關推薦