Python

์›น ํฌ๋กค๋ง (2) BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

byeol_dev 2023. 5. 16. 09:00

๋„ค์ด๋ฒ„ ๋‰ด์Šค์˜ it/๊ณผํ•™ ํƒญ์˜ ํ—ค๋“œ๋ผ์ธ ๊ธฐ์‚ฌ ์ œ๋ชฉ ๋“ค๊ณ ์˜ค๊ธฐ

๋„ค์ด๋ฒ„ ๋‰ด์Šค ์ œ๋ชฉ ๊ฐ€์ ธ์˜ค๊ธฐ

#ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ฐ€์ ธ์˜ด.
#์•„๋ž˜ ์ฝ”๋“œ๋Š” ๋งค๋ฒˆ ํ•„์š”
from bs4 import BeautifulSoup
from urllib.request import urlopen

#๋ฌธ์ž์—ด๋กœ ์ •๋ณด ๋“ค๊ณ ์˜ค๊ธฐ
html = urlopen('https://news.naver.com/main/main.naver?mode=LSD&mid=shm&sid1=105')

# ๊ฐ€์ ธ์˜จ ๋ฌธ์ž์—ด ํ˜•ํƒœ์˜ html์„ BeautifulSoup ๊ฐ์ฒด๋กœ ๋ณ€ํ™˜
# bs = html์ด ๋“ค์–ด์žˆ๋Š” ๊ฐ์ฒด
bs = BeautifulSoup(html.read(), 'html.parser')

๋ชจ๋“  ๊ธฐ์‚ฌ ์ œ๋ชฉ์„ ๋“ค๊ณ ์˜ค๋ ค๋ฉด ์–ด๋–ค ํƒœ๊ทธ๋ฅผ ์„ ํƒํ•ด์•ผ ํ•˜๋Š”์ง€ ์ž˜ ์ƒ๊ฐํ•ด๋ด์•ผํ•จ.

๋‰ด์Šค ๊ธฐ์‚ฌ ์ „์ฒด ์˜์—ญ์„ ๋‹ค ๊ฐ์‹ธ๊ณ  ์žˆ๋Š” div๋ฅผ ๋“ค๊ณ  ์™€์•ผํ•จ.

allDiv = bs.find_all('div', class_ = 'list_body section_index')

ํ—ค๋“œ๋ผ์ธ ๋‰ด์Šค ์ •๋ณด๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ

li ํƒœ๊ทธ ํ•˜๋‚˜ ํ•˜๋‚˜๊ฐ€ ๋‰ด์Šค 1๊ฐœ์ž„.

#ํ—ค๋“œ๋ผ์ธ ๋‰ด์Šค ์ œ๋ชฉ ๊ฐ€์ ธ์˜ค๊ธฐ
headlineDiv = allDiv.find('div', class_ = '_persist') 
headlineDiv
#li ํ•˜๋‚˜๊ฐ€ ๋‰ด์Šค ํ•˜๋‚˜ํ•˜๋‚˜์ž„.
new_li_tags = headlineDiv.find('div').find('ul').find_all('li')
new_li_tags

li ํƒœ๊ทธ๋ฅผ ์„ ํƒํ•ด๋ณด๋ฉด class๊ฐ€ sh_text์ธ div์— aํƒœ๊ทธ ์•ˆ์— ์ œ๋ชฉ์ด ์žˆ์Œ

์ œ๋ชฉ๋“ค ๋ฝ‘์•„์„œ ๋ฆฌ์ŠคํŠธ์— ์ €์žฅ.

titleList = []

for li_tag in new_li_tags :
    title = li_tag.find('div', class_="sh_text").find('a').get_text()
    print(title)
    titleList.append(title)

 

๋‰ด์Šค ๋ชฉ๋ก์ด ์ž๋™์œผ๋กœ ๋ฐ”๋€Œ๊ธฐ ๋•Œ๋ฌธ์— ๋„ค์ด๋ฒ„์— ๋–  ์žˆ๋Š” ๊ธฐ์‚ฌ์™€ ์ถ”์ถœ ๊ฒฐ๊ณผ ์กฐ๊ธˆ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ.