Python爬虫入门（4）————ajax加载的拉勾网（IP代理，headers随机）

关键词：爬虫，拉勾网

这一节，我们讨论一下ajax加载且是POST方法的拉勾网如何爬取

先讲一下POST的用法——传递参数 data

通常，你想要发送一些编码为表单形式的数据——非常像一个 HTML 表单。要实现这个，只需简单地传递一个字典给 data 参数。你的数据字典在发出请求时会自动编码为表单形式

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.post("http://httpbin.org/post", data=payload)
>>> print(r.text)
{
  ...
  "form": {
    "key2": "value2",
    "key1": "value1"
  },
  ...
}

我们先去看拉勾网的页面

我们查看源码如下

现在我们就可以根据源码，解析出大概的数据了，下面我讲一下爬取的思路

伪装浏览器

伪装浏览器我们使用fake_useragent库，其他的用法请看库的介绍

1
2
3

from fake_useragent import UserAgent
userAgentInstance = UserAgent() #创建一个useragent对象，并实例化他
userAgent = userAgentInstance.random   #这个实例的random属性，随机提供不同的useragent

然后把它放到headers里就可以了

1
2
3

headers = {
    'User-Agent' : userAgent,
}

简单的IP池

首先找到一些免费的代理ip，然后，使用这些IP做一个简单的IP池

proxiesList = [  #随机找一部分好用的即可，只要不是爬取大量的数据，这些够用了
    'http://120.32.209.231:8118',
    'http://210.82.36.142:80',
    'http://123.125.116.151:80',
    'http://113.5.80.144:8080',
    'http://122.224.227.202:3128',
    'http://220.191.214.176:8080'
]

制作一个字典格式的proxies

1 2	import random proxies = {'http' : random.choice(proxiesList)}

放入requests请求即可

1	responses = requests.get(url, headers = headers, proxies = proxies)

requests里的具体传入参数请查看库文档

解析主页的所有职位对应的URL

def getAllUrl():
  	#主页URL
    mainPageUrl = 'https://www.lagou.com/'
    responses = requests.get(mainPageUrl,headers = headers).text
    #获取主页的Element对象
    lagouHtml = etree.HTML(responses)
    urlNames = lagouHtml.xpath('//div[@class="sidebar"]/div/div/div[2]/dl/dd/a/text()')
    urls = lagouHtml.xpath('//div[@class="sidebar"]/div/div/div[2]/dl/dd/a/@href')
    #把职位与其对应的URL放到字典里，以便存储与后续使用
    for urlName,url in zip(urlNames,urls):
        dict = {
            'urlName' : urlName,
            'url' : url
        }
        yield dict    #这里使用了一个yield，使用yield后这个就不再是一个函数了，而是一个生成器！
def saveSubUrl(dict):
    client =  MongoClient()
    db = client.拉勾主站所有网址
    sheet = db.网址明细
    sheet.insert_one(dict)
if __name__ == '__main__':
  	getAllUrl()

根据这些职位与对应的URL找到对应网页，并爬取想要的数据

def urlResp():
    #for循环，取出getAllUrl()函数里yield的职位网址的字典
    for item in getAllUrl():
        设定初始的一个状态码和循环的 i
        statusCode = 200
        i = 1
        #对每一个URL进行一个while循环，只要是这个URL拼接出的URL状态码是200(有正确数据返回)
        #因为一旦URL有错误数据返回，状态码变为4XX等
        while(statusCode == 200):
          	#拼接URL，翻页数据
            url = item['url'] + str(i)
            #尝试获取数据，如果不成功，重设状态码，并跳过后面部分再次while循环，直到状态码错误
            try:
                responses = requests.get(url,headers = headers, proxies = proxies)
                #如果状态码不等于200，这接跳出while循环
                if responses.status_code != 200:
                  	continue
            except:
                statusCode = 200
                continue
            statusCode = responses.status_code
            resp = etree.HTML(responses.text)
            i = i + 1
            #因为数据库名称不能以'.'开头，所以要规避
            if item['urlName'] == '.NET':
                name = 'NET'
            else:
                name = item['urlName']
            # yield responseHtml
            companies = resp.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[2]/div[1]/a/text()')
            positions = resp.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[1]/a/h3/text()')
            places = resp.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[1]/a/span/em/text()')
            pays = resp.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[2]/div/span/text()')
            works = resp.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[2]/div[2]/text()')
            descriptions = resp.xpath('//*[@id="s_position_list"]/ul/li/div[2]/div[2]/text()')
            needs = resp.xpath('//*[@id="s_position_list"]/ul/li/div[2]/div[1]/span/text()')
            for company,position,place,pay,work,description,need in zip(companies,positions,places,pays,works,descriptions,needs):
                dict = {
                    '公司' : company.strip(),
                    '职位' : position.strip(),
                    '工作地点' : place.strip(),
                    '薪资' : pay.strip(),
                    '工作职务' : work.strip(),
                    '公司描述' : description.strip(),
                    '能力需要' : need.strip()
                }
                storeData(dict,name)
			#循环一次暂停一秒，多次反爬措施
            time.sleep(1)
#储存到数据库中的函数定义
def storeData(data,DbSheet):
    client = MongoClient()
    db = client['拉勾网']
    mySet = db[DbSheet]
    mySet.insert_one(data)
    print("保存成功")