最近接了一个活，需要爬取日本论文网站的论文信息并且把对应的论文全部下载下来。对于这种重复劳动，计算机的学生当然不会手动一个一个下载，所以临时学了一点爬虫，照着写了一个小代码。

设计爬虫

因为日本没有类似中国知网这种官方论文网站，所以我就选取了J-stage作为这次爬虫的目标网站。J-stage上面的论文几乎免费，在日本范围内是很好的论文检索网站。

要求爬取论文题目，作者所属机关，作者名，发表年月，几卷几号多少页，输出成Excel表格，顺便下载已经爬取的论文。

一开始想用Selenium库来爬取，因为看上去和手动操作极度类似，但是Selenium库存在代码不易编写的问题，而且Selenium作为一个网页测试工具，运行速度也成问题。所以转而用普遍方法Request。

代码

本代码是基于网上爬取豆瓣电影的代码修改而成，其中部分细节借鉴了Github上的jstage-spider项目。在此表示感谢。

因为主要架构大体相同，不同的杂志只需要修改URL和卷号就行，所以本代码以爬取コンピュータ＆エデュケーション杂志为例。

'''
  function:爬取j-stage（コンピュータ＆エデュケーション）杂志论文，并写入Excel文件
  env:python3.6.5
  author:lmx
'''
import time
import requests

from openpyxl import workbook  # 写入Excel表所用
from bs4 import BeautifulSoup as bs
class Jstage:
    def __init__(self):
        #起始地址
        self.start_url = 'https://www.jstage.jst.go.jp/browse/konpyutariyoukyouiku/' #url前半部分不变，只需修改最后数字
        #请求头，浏览器模拟
        self.headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
        }
        #本卷有多少号
        self.page_num = 48

    '''url拼接'''
    def get_page_url(self):
        n = 47 #第一页开始,下标0
        while n<self.page_num:
            yield self.start_url+ str(n) + '/0/_contents/-char/ja'
            n += 1

    '''获取页面源码'''
    def getHtml(self):
        gu = self.get_page_url() #url生成器
        for url in gu:
            html = requests.get(url,headers=self.headers).text
            yield html

    '''数据提取'''
    def getData(self):
        gh = self.getHtml() # html源码生成器
        for html in gh: # html:网页源码
            soup = bs(html, 'html.parser')
            #tmp = soup.find_all('ul', class_='search-resultslisting') # 辅助定位
            for ul in soup.find_all('ul', class_='search-resultslisting'):
                for li in ul.find_all('li'):
                    # 标题
                    title = li.find('div',class_='searchlist-title').text.strip()
                    # 找到标题内部的链接，跳转进去
                    title_url = li.find('div', class_='searchlist-title').find('a').get('href')
                    # 将详情页转为html
                    title_html = requests.get(title_url).text
                    soup = bs(title_html, 'html.parser')
                    # 找到作者所在研究机关
                    try:
                        institution = soup.find('ul', class_='accodion_body_ul').find('li').find('p').text.strip()
                    except:
                        institution = "none"
                    # 查找论文keyword
                    try:
                        pre_keyword = soup.find('div', class_='global-para').text.strip()
                        pre_keyword = pre_keyword.replace('\u2003', '').replace('\u3000', '').replace('\t', '').replace('\n', '').strip()
                    except:
                        pre_keyword = "none:none"

                    # 对keyword进行分割，只留关键信息
                    try:
                        pre_keyword = pre_keyword.split(":",1)
                        pre, keyword = pre_keyword[0],pre_keyword[1]
                    except:
                        keyword = "none"

                    # 在搜索页查找作者名，发表年份，几卷几号，第几页
                    try:
                        author = li.find('div',class_='searchlist-authortags customTooltip').text.strip()
                        info = li.find('div',class_='searchlist-additional-info').text.strip()
                        info = info.replace('\u2003', '').replace('\u3000', '').replace('\t', '').replace('\n', '').strip()
                        info = info.split("年",1)
                        year, rem = info[0],info[1]
                        rem = rem.split("p.",1)
                        vol, page = rem[0],rem[1]
                        page = page.split("発",1)
                        page = page[0]
                    except:
                        print(title + ",未能收集")
                        author = "no author"
                        year = "no year"
                        vol = "no vol"
                        page = "no page"

                    # 查找概要
                    try:
                        abstract = li.find('div', class_='inner-content abstract').text.strip()
                        abstract = abstract.replace('\u2003', '').replace('\u3000', '').replace('\t', '').replace('\n', '').strip()
                    except:
                        abstract = "none. 抄録全体を表示"

                    abstract = abstract.split("抄録全体", 1)
                    abstract = abstract[0]

                    # 执行论文pdf下载
                    pre_file_url = li.find('div', class_='lft').find('span').find('a')
                    file_url = pre_file_url.get('href')

                    print(file_url)
                    try:
                        r = requests.get(file_url, stream=True)
                        with open("[" + vol + "]" + page + ".pdf", "wb") as pdf:
                            for chunk in r.iter_content(chunk_size=1024):
                                if chunk:
                                    pdf.write(chunk)
                        print("完成收集:" + title)
                    except Exception as e:
                        print("------------------------------------------------未能下载:" + title)
                        print("原因:%s"%e)

                    # 对服务器仁慈
                    time.sleep(15)
                    yield [author, institution, title, keyword, year, vol, page, abstract]


    '''保存到excel文件
    :param file_name:文件名
    '''
    def saveToExcel(self,file_name):
        wb = workbook.Workbook()  # 创建Excel对象
        ws = wb.active  # 获取当前正在操作的表对象
        ws.append(['author', 'institution', 'title', 'keyword', 'year', 'vol', 'page', 'abstract'])
        gd = self.getData() #数据生成器
        for data in gd:
            ws.append(data)
        wb.save(file_name)

if __name__ == '__main__':
    start = time.time()
    top = Jstage()
    try:
        top.saveToExcel('computer_edu.xlsx')
        print('抓取成功,用时%4.2f'%(time.time()-start)+'秒')
    except Exception as e:
        print('抓取失败,原因:%s'%e)

一些小项目

#python #爬虫

【小项目】爬取J-stage论文信息的爬虫

http://liuminxuan.github.io/2020/12/10/爬取J-stage论文信息的爬虫/

发布于

2020年12月10日

许可协议

【Java学习】默认方法上一篇

【Java学习】抽象类和接口的区别下一篇