Python - 爬蟲爬取和登陸github

阿新 • • 發佈：2018-10-31

用API搜尋GitHub中star數最多的前十個庫，並用post方法登陸並點選收藏

一用API搜尋GitHub中star數最多的前十個庫

利用GitHub提供的API爬取前十個star數量最多的Python庫

GitHub提供了很多專門為爬蟲準備的API介面，通過介面可以爬取到便捷，易處理的資訊。（這是GitHub官網的各種api介紹）

我將要使用的連結是https://api.github.com/search/repositories?q=language:python&sort=stars。 ‘language:’後是要搜尋的語言，‘sort=’後是搜尋的排序方式，這個連結以star數量返回前三十個Python庫的資訊。

使用到的庫

import requests

通過get請求到網頁的資訊

response = requests.get('https://api.github.com/search/repositories?q=language:python&sort=stars')
#檢測是否請求成功，若成功，狀態碼應該是200
if(response.status_code != 200):
    print('error: fail to request')

若我們自己點入上方的連結，會發現一個特別的網頁，沒有介面，只有由簡單的字元組成。

仔細觀察，會發現字元和字典的結構是相同的，最上層是三個關鍵詞，其中 'items'關鍵詞儲存有一個List，裡面有多組字典資訊，每一個字典儲存有一個python庫的詳細資訊。

所以直接提取相應資訊即可

#獲取的是一個json格式的字典物件
j = response.json()
#'items'下包括了前三十個庫的所有詳細資訊
items= j['items']

#儲存前十個資料
message = []
for i in range(10):
    pro = items[i]
    message.append(pro['full_name'])#庫的'作者/名字'
    #依次列印
    print('top%d:' % (i+1), pro['name'])#列印庫的名字

列印結果：

top1: awesome-python
top2: system-design-primer
top3: models
top4: public-apis
top5: youtube-dl
top6: flask
top7: thefuck
top8: httpie
top9: django
top10: awesome-machine-learning

完整程式碼：

'''
用於列印並返回前十個最受歡迎的庫
傳入的是搜尋的語言，這個程式需要傳入'python',
返回的是一個存有前十個庫的'作者名/庫名'的List
'''
def find_top10(language):
    #利用github提供的api介面進行搜尋
    response = requests.get(
        "https://api.github.com/search/repositories?q=language:%s&sort=stars"%(language))

    #檢測是否請求成功
    if(response.status_code != 200):
        print('error! in find_top10(): fail to request')
        return None

    #獲取的是一個json格式的字典物件
    j = response.json()
    #keyword 'items'下的包括了前三十個庫的所有詳細資訊
    progress = j['items']

    #儲存前十個資料
    message = []
    for i in range(10):
        pro = j['items'][i]
        message.append(pro['full_name'])
        #依次列印
        print('top%d:' % (i+1), pro['name'])
    return message

二用post方法登陸GitHub

先擼一擼post需要的東西。首先我們肯定要知道我們的url是什麼，其次我們得知道我們需要提交哪些資訊（data）

我們先點開GitHub的登陸介面，嘗試著登陸，利用谷歌的“右鍵->檢查->network"檢視登陸時的資訊。

找到了登陸需要使用的url，以及登陸時需要上傳的資訊（data）

但我們發現，Data中有一個authenticity_token，知乎也有這個東西，貌似是用來保證安全用的。但重點是，我們從哪兒去找這個值呢？

authenticity_token的值往往隱藏在上一級網頁內

果然，在登陸介面的網頁程式碼中找到了隱藏的authenticity_token資訊（記住這個方法）

我們根據上方資訊進行登陸：（authenticity_token值需要用爬蟲尋找而不能直接複製瀏覽器中的值！）

用get方法獲取登陸介面的資訊，得到cookies以及authenticity_token

r1 = requests.get('https://github.com/login')

    soup = BeautifulSoup(r1.text, features='lxml')
     #獲取authenticity_token
    att = soup.find(name='input', attrs={'name': 'authenticity_token'}).get('value')
    #獲取cookies
    cookie = r1.cookies.get_dict()

利用得到的資訊構造Data並登陸

r2 = requests.post(
    'https://github.com/session',
    data={
        'commit': 'Sign in',
        'utf8': '✓',
        'authenticity_token': att,
        'login': username,
        'password': password
        },
    cookies=cookie
)
#返回登陸後的cookies可用於以登陸身份訪問GitHub
return r2.cookies.get_dict()

完整程式碼：

'''
用於登陸github的函式
傳入賬號密碼，登陸並返回登陸後的cookies
'''
def login(username, password):
    #get登陸頁面並獲取 authenticity_token 和 cookies（登陸需要）
    r1 = requests.get('https://github.com/login')

    if(r1.status_code != 200):
        print('error! in login(): fail to request')
        return None

    soup = BeautifulSoup(r1.text, features='lxml')
     #獲取authenticity_token
    att = soup.find(name='input', attrs={
                        'name': 'authenticity_token'}).get('value')
    #獲取cookies
    cookie = r1.cookies.get_dict()
    #利用獲取的資訊登陸github
    r2 = requests.post(
        'https://github.com/session',
        data={
            'commit': 'Sign in',
            'utf8': '✓',
            'authenticity_token': att,
            'login': username,
            'password': password
        },
        cookies=cookie
    )

    if(r2.status_code != 200):
        print('error! in login(): fail to request')
        return None
        
    print('successed login')
    return r2.cookies.get_dict()

三用post方法收藏star數量最多的前十個python倉庫

我們已經找到了這十個庫，也成功登陸了GitHub

所以我們現在需要做的就是尋找收藏某個一個庫的方法

我隨便點開了一個庫‘thefuck'，可以在右上角找到star的按鈕

於是我用登陸時同樣的方法，點選按鈕，並用谷歌分析資訊，

成功的找到了post的url地址，以及需要提交的data

我們又看到了熟悉的authenticity_token！真是麻煩啊，但我們已經學會了如何尋找這個值

果然，我在這個網頁的程式碼裡面找到了隱藏的authenticity_token。

所以我們需要用get方法得到這個網頁，並在網頁程式碼中找到這個值

通過get獲取authenticity_token，這裡傳入了登陸之後得到的cookies

get2 = requests.get(
        'https://github.com/nvbn/thefuck',
        cookies=cookie
    )
    #更新cookies
    cookie = get2.cookies.get_dict()
    
    #得到authenticity_token
    soup2 = BeautifulSoup(get2.text, features='lxml')
    #據分析，star的authenticity_token是放於class屬性為unstarred js-social-form的塊中
    soup = soup2.find('form', {'class': 'unstarred js-social-form'})
    att2 = soup.find(name='input', attrs={
        'name': 'authenticity_token'}).get('value')

提交post請求進行收藏，提交成功後便收藏成功

#提交Post請求，請求之後即收藏了該庫
r = requests.post(
    'https://github.com/nvbn/thefuck/star',
        data={
         'utf8': '✓',
        'authenticity_token': att2,
        'context': 'repository'
    },
    cookies=cookie
 )

完整程式碼：

'''
利用登陸的cookies點選star收藏專案

這裡我分析了某個庫的網頁程式碼，按鈕'star'中，提供了一‘authenticity_token’和一個用於post的連結（https://github.com/作者名/庫名/star），只要獲取‘authenticity_token’並用登陸的cookies訪問連結，即可實現點選該按鈕的同樣效果。

所以只需要傳入cookies 和 作者名/庫名即可
'''
def get_stared(cookie, name):
    #get到該庫的介面並得到用於 點選star的post提交的authenticity_token
    get2 = requests.get(
        #利用傳入的‘作者名/庫名’進行操作
        'https://github.com/%s'%(name),
        cookies=cookie
    )
    cookie = get2.cookies.get_dict()
    #檢驗請求情況
    if(response.status_code != 200):
        print('error! in get_stared(): get: fail to request')
        return None
    
    #得到authenticity_token
    soup2 = BeautifulSoup(get2.text, features='lxml')
    soup = soup2.find('form', {'class': 'unstarred js-social-form'})
    att2 = soup.find(name='input', attrs={
        'name': 'authenticity_token'}).get('value')

    #提交Post請求，請求之後即收藏了該庫
    r = requests.post(
        'https://github.com/%s/star'%(name),
        data={
            'utf8': '✓',
            'authenticity_token': att2,
            'context': 'repository'
        },
        cookies=cookie
    )
    #檢驗請求情況，這裡比較特殊，返回錯誤碼400的時候收藏成功
    if(response.status_code != 400):
        print('error! in get_stared(): get: fail to request')
        return None

全部程式碼：加入了執行需要的函式和句子

'''
1.利用爬蟲登陸github
2.搜所前十個最受歡迎的Python庫（star數作為排序標準）
3.並收藏它們

名字：周開顏
學號：2017141461386

經測試，所有功能成功實現
'''

import requests
from bs4 import BeautifulSoup

'''
用於登陸github的函式
傳入賬號密碼，登陸並返回登陸後的cookies
'''


def login(username, password):
    #get登陸頁面並獲取 authenticity_token 和 cookies（登陸需要）
    r1 = requests.get('https://github.com/login')
    soup = BeautifulSoup(r1.text, features='lxml')

    #獲取authenticity_token
    att = soup.find(name='input', attrs={
        'name': 'authenticity_token'}).get('value')
    #獲取cookies
    cks = r1.cookies.get_dict()

    #利用獲取的資訊登陸github
    r2 = requests.post(
        'https://github.com/session',
        data={
            'commit': 'Sign in',
            'utf8': '✓',
            'authenticity_token': att,
            'login': username,
            'password': password
        },
        cookies=cks
    )
    print("successed login!")
    return r2.cookies.get_dict()


'''
用於列印並返回前十個最受歡迎的庫
傳入的是搜尋的語言，這個程式需要傳入'python',
返回的是一個存有前十個庫的'作者名/庫名'的List
'''


def find_top10(language):
    #利用github提供的api介面進行搜尋
    response = requests.get(
        "https://api.github.com/search/repositories?q=language:%s&sort=stars" % (language))

    #檢測是否請求成功
    if(response.status_code != 200):
        print('error! in find_top10(): fail to request')
        return None

    #獲取的是一個json格式的字典物件
    j = response.json()
    #keyword 'items'下的包括了前三十個庫的所有詳細資訊
    progress = j['items']

    #儲存前十個資料
    message = []
    for i in range(10):
        pro = j['items'][i]
        message.append(pro['full_name'])
        #依次列印
        print('top%d:' % (i+1), pro['name'])
    return message


'''
利用登陸的cookies點選star收藏專案

這裡我分析了某個庫的網頁程式碼，按鈕'star'中，提供了一‘authenticity_token’和一個用於post的連結（https://github.com/作者名/庫名/star），只要獲取‘authenticity_token’並用登陸的cookies訪問連結，即可實現點選該按鈕的同樣效果。

所以只需要傳入cookies 和 作者名/庫名即可
'''


def get_stared(cookie, name):
    #get到該庫的介面並得到用於 點選star的post提交的authenticity_token
    get2 = requests.get(
        #利用傳入的‘作者名/庫名’進行操作
        'https://github.com/%s' % (name),
        cookies=cookie
    )
    cookie = get2.cookies.get_dict()
    #檢驗請求情況
    if(get2.status_code != 200):
        print('error! in get_stared(): get: fail to request')
        return None

    #得到authenticity_token
    soup2 = BeautifulSoup(get2.text, features='lxml')
    soup = soup2.find('form', {'class': 'unstarred js-social-form'})
    att2 = soup.find(name='input', attrs={
        'name': 'authenticity_token'}).get('value')

    #提交Post請求，請求之後即收藏了該庫
    r = requests.post(
        'https://github.com/%s/star' % (name),
        data={
            'utf8': '✓',
            'authenticity_token': att2,
            'context': 'repository'
        },
        cookies=cookie
    )
    #檢驗請求情況，這裡比較特殊，返回錯誤碼400的時候收藏成功
    if(r.status_code != 400):
        print('error! in get_stared(): post: fail to request')
        return None
    print(name, 'stared')


#主函式
if __name__ == '__main__':
    #登陸
    id = input('please input your github id\n')
    ps = input('please input your github password\n')
    coks = login(id, ps)

    #獲取前十個最受歡迎的python庫
    messages = find_top10('python')

    #為這十個庫點star（收藏）
    for m in messages:
        if(type(m) == str):
            get_stared(coks, m)

Python - 爬蟲爬取和登陸github

用API搜尋GitHub中star數最多的前十個庫，並用post方法登陸並點選收藏一用API搜尋GitHub中star數最多的前十個庫利用GitHub提供的API爬取前十個star數量最多的Python庫 GitHub提供了很多專門為爬蟲準

用python爬蟲爬取和登陸github

一利用API簡單爬取利用GitHub提供的API爬取前十個star數量最多的Python庫 GitHub提供了很多專門為爬蟲準備的API介面，通過介面可以爬取到便捷，易處理的資訊。（這是GitHub官網的各種api介紹）使用到的庫 import re

python爬蟲爬取淘寶，羅蘭電鋼琴和雅馬哈電鋼琴（參考崔大）

淘寶網上有很多商品，這些商品的資訊就是一個很不錯的資料來源，於是我參考資料後依葫蘆畫瓢弄了一個爬蟲程式來爬一爬夢寐以求的電鋼琴。宣告一下：電鋼琴和電子琴是兩種不同的琴，我在正則表示式裡面設定了只要含有電子琴這個詞語一律不抓取。同時淘寶商家的很多商品欄都是重複的，不加篩選前

使用python爬蟲——爬取淘寶圖片和知乎內容

本文主要內容：目標：使用python爬取淘寶圖片；使用python的一個開源框架pyspider（非常好用，一個國人寫的）爬取知乎上的每個問題，及這個問題下的所有評論最簡單的爬蟲——如下python程式碼爬取淘寶上模特圖片爬

Python 爬蟲——爬取小說 | 探索白子畫和花千骨的愛恨情仇

知識就像碎布，記得“縫一縫”，你才能華麗麗地亮相。 1.Beautiful Soup 1.Beautifulsoup 簡介此次實戰從網上爬取小說，需要使用到Beautiful Soup。 Beautiful Soup為python的第三方庫，可以幫助我們從網頁抓取資料。

python爬蟲爬取github專案裡的評論

這幾天因為實驗需要，對github上的bitcoin裡的評論資訊進行了爬取。現在貼出原始碼： import urllib.request import re from bs4 import BeautifulSoup import io import sys import

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

python爬蟲爬取頁面源碼在本頁面展示

一個 nts ring 想要 strip code 空白列表 ngs python爬蟲在爬取網頁內容時，需要將內容連同內容格式一同爬取過來，然後在自己的web頁面中顯示，自己的web頁面為django框架首先定義一個變量html，變量值為一段HTML代碼 >&

python 爬蟲爬取證券之星網站

爬蟲周末無聊，找點樂子。。。#coding:utf-8 import requests from bs4 import BeautifulSoup import random import time #抓取所需內容 user_agent = ["Mozilla/5.0 (Windows NT 10.0

python爬蟲爬取海量病毒文件

tle format nbsp contex logs request spl tde __name__ 因為工作需要，需要做深度學習識別惡意二進制文件，所以爬一些資源。 # -*- coding: utf-8 -*- import requests import re

用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）

enc 用途 css選擇器狀態 csv文件表格 area 加密重要用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）在進行爬取前，首先要了解： 1、什麽是CSS選擇器？每一條css樣式定義由兩部分組成，形式如下： [code] 選擇器{樣式} [/code

python爬蟲——爬取古詩詞

爬蟲古詩詞實現目標 1.古詩詞網站爬取唐詩宋詞 2.落地到本地數據庫頁面分析通過firedebug進行頁面定位：源碼定位：根據lxml etree定位div標簽：# 通過 lxml進行頁面分析 response = etree.HTML(data

利用Python爬蟲爬取淘寶商品做數據挖掘分析實戰篇，超詳細教程

實戰趨勢 fat sts AI top 名稱 2萬安裝模塊項目內容本案例選擇>> 商品類目：沙發；數量：共100頁 4400個商品；篩選條件：天貓、銷量從高到低、價格500元以上。項目目的 1. 對商品標題進行文本分析詞雲可視化 2.

Python爬蟲 - 爬取百度html代碼前200行

http src mage bsp bubuko str 百度爬蟲圖片 Python爬蟲 - 爬取百度html代碼前200行 - 改進版, 增加了對字符串的.strip()處理 Python爬蟲 - 爬取百度html代碼前200行

簡易python爬蟲爬取boss直聘職位，並寫入excel

python爬蟲寫入excel1，默認城市是杭州，代碼如下#! -*-coding:utf-8 -*-from urllib import request, parsefrom bs4 import BeautifulSoupimport datetimeimport xlwt starttime = dat

Python 爬蟲爬取微信文章

微信爬蟲爬取微信文章爬取公眾號文章搜狗微信平臺為入口地址：http://weixin.sogou.com/ --------------------------------------------------------------搜索關鍵詞“科技”對比網址變化情況查看網址http://wei

python爬蟲爬取QQ說說並且生成詞雲圖，回憶滿滿！

運維開發網絡分析 matplot 容易 jieba 編程語言提示框然而 Python（發音：英[?pa?θ?n]，美[?pa?θɑ:n]），是一種面向對象、直譯式電腦編程語言，也是一種功能強大的通用型語言，已經具有近二十年的發展歷史，成熟且穩定。它包含了一組完善而且

Python爬蟲爬取OA幸運飛艇平臺獲取數據

sta 獲取數據 status fail attrs color wrapper 排行榜 req 安裝BeautifulSoup以及requests 打開window 的cmd窗口輸入命令pip install requests 執行安裝，等待他安裝完成就可以了 Beaut

利用python爬蟲爬取圖片並且制作馬賽克拼圖

python爬蟲 splay ise 做事 c-c sea mage item -a 　　想在妹子生日送妹子一張用零食（或者食物類好看的圖片）拼成的馬賽克拼圖，因此探索了一番= =。　　首先需要一個軟件來制作馬賽克拼圖，這裏使用Foto-Mosaik-Edda（網上也有在

用Python爬蟲爬取豆瓣電影、讀書Top250並排序

更新：已更新豆瓣電影Top250的指令碼及網站概述經常用豆瓣讀書的童鞋應該知道，豆瓣Top250用的是綜合排序，除使用者評分之外還考慮了很多比如是否暢銷、點選量等等，這也就導致了一些近年來評分不高的暢銷書在這個排行榜上高高在上遠比一些經典名著排名還高，於是在這裡打算重新給To

Python - 爬蟲爬取和登陸github

用API搜尋GitHub中star數最多的前十個庫，並用post方法登陸並點選收藏

一 用API搜尋GitHub中star數最多的前十個庫

二 用post方法登陸GitHub

三 用post方法收藏star數量最多的前十個python倉庫

相關推薦

一用API搜尋GitHub中star數最多的前十個庫

二用post方法登陸GitHub

三用post方法收藏star數量最多的前十個python倉庫