1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139
| ''' function:爬取j-stage(コンピュータ&エデュケーション)杂志论文,并写入Excel文件 env:python3.6.5 author:lmx ''' import time import requests
from openpyxl import workbook from bs4 import BeautifulSoup as bs class Jstage: def __init__(self): self.start_url = 'https://www.jstage.jst.go.jp/browse/konpyutariyoukyouiku/' self.headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', } self.page_num = 48
'''url拼接''' def get_page_url(self): n = 47 while n<self.page_num: yield self.start_url+ str(n) + '/0/_contents/-char/ja' n += 1
'''获取页面源码''' def getHtml(self): gu = self.get_page_url() for url in gu: html = requests.get(url,headers=self.headers).text yield html
'''数据提取''' def getData(self): gh = self.getHtml() for html in gh: soup = bs(html, 'html.parser') for ul in soup.find_all('ul', class_='search-resultslisting'): for li in ul.find_all('li'): title = li.find('div',class_='searchlist-title').text.strip() title_url = li.find('div', class_='searchlist-title').find('a').get('href') title_html = requests.get(title_url).text soup = bs(title_html, 'html.parser') try: institution = soup.find('ul', class_='accodion_body_ul').find('li').find('p').text.strip() except: institution = "none" try: pre_keyword = soup.find('div', class_='global-para').text.strip() pre_keyword = pre_keyword.replace('\u2003', '').replace('\u3000', '').replace('\t', '').replace('\n', '').strip() except: pre_keyword = "none:none"
try: pre_keyword = pre_keyword.split(":",1) pre, keyword = pre_keyword[0],pre_keyword[1] except: keyword = "none"
try: author = li.find('div',class_='searchlist-authortags customTooltip').text.strip() info = li.find('div',class_='searchlist-additional-info').text.strip() info = info.replace('\u2003', '').replace('\u3000', '').replace('\t', '').replace('\n', '').strip() info = info.split("年",1) year, rem = info[0],info[1] rem = rem.split("p.",1) vol, page = rem[0],rem[1] page = page.split("発",1) page = page[0] except: print(title + ",未能收集") author = "no author" year = "no year" vol = "no vol" page = "no page"
try: abstract = li.find('div', class_='inner-content abstract').text.strip() abstract = abstract.replace('\u2003', '').replace('\u3000', '').replace('\t', '').replace('\n', '').strip() except: abstract = "none. 抄録全体を表示"
abstract = abstract.split("抄録全体", 1) abstract = abstract[0]
pre_file_url = li.find('div', class_='lft').find('span').find('a') file_url = pre_file_url.get('href')
print(file_url) try: r = requests.get(file_url, stream=True) with open("[" + vol + "]" + page + ".pdf", "wb") as pdf: for chunk in r.iter_content(chunk_size=1024): if chunk: pdf.write(chunk) print("完成收集:" + title) except Exception as e: print("------------------------------------------------未能下载:" + title) print("原因:%s"%e)
time.sleep(15) yield [author, institution, title, keyword, year, vol, page, abstract]
'''保存到excel文件 :param file_name:文件名 ''' def saveToExcel(self,file_name): wb = workbook.Workbook() ws = wb.active ws.append(['author', 'institution', 'title', 'keyword', 'year', 'vol', 'page', 'abstract']) gd = self.getData() for data in gd: ws.append(data) wb.save(file_name)
if __name__ == '__main__': start = time.time() top = Jstage() try: top.saveToExcel('computer_edu.xlsx') print('抓取成功,用时%4.2f'%(time.time()-start)+'秒') except Exception as e: print('抓取失败,原因:%s'%e)
|