国产成人剧情av麻豆果冻,国产无遮挡又爽又黄的视频,天天躁夜夜躁狠狠躁婷婷,性色香蕉av久久久天天网,真人性生交免费视频

終于到威望2了,發(fā)個(gè)用python寫的QA爬蟲腳本來看下有沒有Python愛好者

發(fā)帖6次 被置頂0次 被推薦0次 質(zhì)量分1星 回帖互動(dòng)59次 歷史交流熱度7.44% 歷史交流深度0%
自學(xué)的不專業(yè),寫的不規(guī)范可能錯(cuò)誤多,源碼發(fā)出來看下有沒有專業(yè)的或者愛好者來給點(diǎn)建議。這個(gè)的主要目的是找出客戶關(guān)心的點(diǎn),通過詞頻先宏觀的看最關(guān)注點(diǎn)。
現(xiàn)在只是把內(nèi)容抓下來,要生成詞云的話還是得手動(dòng)復(fù)制到一些詞頻統(tǒng)計(jì)的網(wǎng)站。ANSWER文件會(huì)把一些不需要的東西也爬下來,懶得改了,因?yàn)橛X得QUETION才是重點(diǎn)。
import requests, threading, time
from bs4 import BeautifulSoup
from collections import Counter
from queue import Queue
import sys, os

def get_session():
return requests.session()

# 獲取resp
def fetch(session, url):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'refer': 'https://www.amazon.com/'
}
resp = session.get(url, headers=headers)
return resp

# 獲取當(dāng)前的pages數(shù)
def get_pages(session, url):
url = url.format(asin, 2)
resp = fetch(session, url)
soup = BeautifulSoup(resp.text, 'lxml')
pages = soup.select('.a-pagination > li:nth-last-child(2) > a')[0].text
return pages

# 獲取QA List
def get_qa(session, urlList):
while urlList._qsize():
url = urlList.get()
resp = fetch(session, url)
soup = BeautifulSoup(resp.text, 'lxml')
group_qa = soup.select('.askTeaserQuestions > .a-fixed-left-grid.a-spacing-base > .a-fixed-left-grid-inner > .a-fixed-left-grid-col.a-col-right')
for i in range(len(group_qa)):
question = group_qa[i].select('.a-fixed-left-grid-col.a-col-right > a> span')[0].text.strip('\n').lstrip().rstrip()
answer = group_qa[i].select('.a-fixed-left-grid.a-spacing-base .a-fixed-left-grid-col.a-col-right > span')[0].text
# f.write(question + '\t' + answer + '\n')
fq.write(question + '\n')
fa.write(answer + '\n')

def main(url, asin):
session = get_session()
# 獲取當(dāng)前的頁數(shù)總數(shù)
pages = int(get_pages(session, url))
urlList = Queue() # 將要抓取的url存到queue中
for i in range(pages):
i += 1
furl = url.format(asin, i)
urlList.put(furl)
# 抓取QA
thread_list =
# 設(shè)置線程數(shù)
thread_count = 15
for _ in range(thread_count):
t = threading.Thread(target=get_qa, args=(session, urlList))
t.start()
thread_list.append(t)
for i in range(thread_count):
thread_list[i].join()

# 已廢棄,原本想去除掉常用冠詞和人稱代詞后再統(tǒng)計(jì)詞頻的
def get_most_count(TEXT):
for char in '\n\t.?-':
TEXT = TEXT.replace(char, ' ')
for char in ['the', 'I', 'to', 'you ', 'and ', 'a', 'these', 'it', 'they', 'with', 'have', 'can', 'be', 'at', 'of', 'are', 'them', 'Are']:
TEXT = TEXT.replace(char, '')
word_list = TEXT.split()
print(Counter(word_list).most_common())

if __name__ == "__main__":
os.chdir(sys.path[0])
### configuration ###
# 改asin
asin = 'B07DPJVN6P'
us = 'https://www.amazon.com/'
uk = 'https://www.amazon.co.uk/'
# 改站點(diǎn)
Marketplace = uk
baseurl = Marketplace + 'ask/questions/asin/{}/{}/ref=ask_dp_iaw_ql_hza?isAnswered=true'
fq = open('./questions.txt', 'w', encoding='utf-8') #存放question的文件夾
fa = open('./answers.txt', 'w', encoding='utf-8') #存放question的文件夾
start = time.time()
# 開始運(yùn)行
main(baseurl, asin)
fq.close()
fa.close()
print('duration: %.2f' % (time.time() - start))
# 計(jì)算詞頻
# f = open('a.txt', 'r', encoding='utf-8')
# get_most_count(f.read())
# f.close()[/i][/i][/i]
已邀請(qǐng):
并沒有什么用,我寫過python爬蟲用來爬取Amazon各個(gè)店鋪或品牌的產(chǎn)品,并判斷是否出現(xiàn)新品, 結(jié)果發(fā)現(xiàn)還不如直接去看
要回復(fù)問題請(qǐng)先登錄注冊(cè)

加入賣家社群
關(guān)注公眾號(hào)
加入線下社群
廣告 ×
10s