웹 크롤링 BeautifulSoup 라이브러리

2023. 5. 11. 17:05ㆍPython

웹 크롤링으로 글자를 가져오는 것 > html의 코드를 가져오는 것.

웹 크롤링 시작하기

기본적인 웹 크롤링을 위해서는 BeautifulSoup 라이브러리 필요
먼저 BeautifulSoup 라이브러리를 다운로드. PC당 최초 1번

#라이브러리 다운로드 코드
pip install bs4

#BeautifulSoup 로딩
from bs4 import BeautifulSoup

from urllib.request import urlopen

아래 사이트 크롤링

http://pythonscraping.com/pages/page1.html

A Useful Page

An Interesting Title Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Du

pythonscraping.com

#주소의 사이트를 열어서 연 정보를 html 변수에 저장
# urlopen() : 웹페이지 접속 및 해당 내용을 문자로 가져온다.
html = urlopen('http://pythonscraping.com/pages/page1.html')

# 가져온 문자열 형태의 html을 BeautifulSoup객체로 변환
# bs = html이 들어있는 객체
bs = BeautifulSoup(html.read(), 'html.parser')

객체 안의 태그 내용 가져오기

print(객체.태그명)

print(bs.title)

print(bs.h1)

print(bs.div)

태그 내용 가져오는 다른 방법

객체.find('태그명')

#html에서 title 태그 찾기
#객체.find('태그명')
bs.find('title')

아래 사이트 크롤링

https://www.pythonscraping.com/pages/warandpeace.html

"Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don't tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist- I really believe he is Antichris

www.pythonscraping.com

아래 코드는 크롤링할 때마다 필요함.

#필요한 라이브러리 가져옴.
#아래 코드는 매번 필요
from bs4 import BeautifulSoup
from urllib.request import urlopen

사이트 내용 문자열로 들고오기

#문자열로 정보 들고오기
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')

# 가져온 문자열 형태의 html을 BeautifulSoup객체로 변환
# bs = html이 들어있는 객체
bs = BeautifulSoup(html.read(), 'html.parser')

태그 찾아서 내용 들고오기

-span 태그 내용 들고오기

#찾는 태그가 여러개 있으면 제일 처음에 있는 태그 찾음
print(bs.find('span'))

해당 태그 모두 찾는 법

객체.find_all

# 찾고자 하는 모든 해당 태그 찾는 법
# 찾아온 출력문을 보면 []에 감싸져 있음
# 모든 span 태그를 리스트 형태로 찾아옴
bs.find_all('span')

클래스 속성으로 해당 태그 찾기

# class 속성이 green인 span 태그 전체 선택
bs.find_all('span', class_ = 'green')

위처럼 찾으면 해당 태그의 전체 내용을 가져옴.

가져온 태그 전체 내용에서 해당 태그의 글자만 가져오는 법

가져온 태그 전체 내용을 변수로 리스트에 저장

spanTagList = bs.find_all('span', class_ = 'green')

해당 태그의 글자만 가져오는 법

태그.get_text()

for문 돌려서 span 태그의 글자만 get 해오면 됨.

#span 태그 안의 글자만 가져오는 법
for spanTag in spanTagList :
    print(spanTag.get_text())

'Python' 카테고리의 다른 글

웹 크롤링 (3) Selenium 라이브러리 (0)	2023.05.16
웹 크롤링 (2) BeautifulSoup 라이브러리 (0)	2023.05.16
함수 (0)	2023.05.10
Dictionary (0)	2023.05.08
for문 심화 (0)	2023.05.08

📚개발 복습 노트

📚개발 복습 노트

태그

최근글

댓글

공지사항

아카이브

웹 크롤링 시작하기

'Python' 카테고리의 다른 글

관련글

티스토리툴바