记一次奇怪的爬虫经历

前言

由于某些原因，我需要去爬获取一些国家旅游景点的信息。

找到国家旅游局的网站，然后找到一个 5A 风景区目录。

网址：http://www.cnta.gov.cn:8000/Forms/TravelCatalog/TravelCatalogList.aspx?catalogType=view&resultType=5A

于是去 pyspider 的 demo 页新建一个项目：5stat，就去爬了。

分析页面

网页比较特殊，看起来是用 dotnet 写的，翻页是按钮调用 js 代码实现的。跳转后还是同一个网址。

这里就要用到 pyspider 支持的页面载入后运行 js 脚本的功能。

先分析翻页按钮干了什么。

如下图，调用一个名为 __doPostBack 的函数。

__doPostBack

在页面上寻找这个函数，看到函数体如下：

var theForm = document.forms['form1'];
if (!theForm) {
    theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}

函数将 theForm 里面的 __EVENTTARGET 值设置为 PageNavigator1$LnkBtnNext 之后就提交了。

找到 theForm 对应的元素，看见有三个隐藏域， __EVENTTARGET、__EVENTARGUMENT 和 __VIEWSTATE。

theForm

附近还有一个隐藏域 __EVENTVALIDATION。看名字就觉得要提交。

于是试试只提交这三个值看看会不会报错。

在 chrome 上安装 postman 这个应用，打开。

postman

修改方式为 POST，填上地址和三个域的值，send。

postman result

OK，返回了正确的页面，也就是可行了。

爬虫脚本

嗯 pyspider 的爬虫脚本怎么写就不详述了，不会的看文档。

着重列出爬虫执行的 js 脚本的功能。

function() {
    var flag = 'y';
    if ( document.querySelector('#PageNavigator1_LnkBtnNext').getAttribute('disabled') ) {
        flag = 'n';
    }
    return document.form1.__VIEWSTATE.value + '~' + document.form1.__EVENTVALIDATION.value + '~' + flag;
}