์›น๋ฐ์ดํ„ฐ ํฌ๋กค๋งํ•˜์—ฌ csv ์ƒ์„ฑํ•˜๊ธฐ

2023. 6. 7. 09:58ใ†Python

https://code.visualstudio.com/

 

Visual Studio Code - Code Editing. Redefined

Visual Studio Code is a code editor redefined and optimized for building and debugging modern web and cloud applications.  Visual Studio Code is free and available on your favorite platform - Linux, macOS, and Windows.

code.visualstudio.com

์œ„ ์‚ฌ์ดํŠธ ๋“ค์–ด๊ฐ€์„œ ์„ค์น˜

<ํ•œ๊ธ€ ์ ์šฉ>

Ctrl + Shift + P

language > Configure Display language  > ํ•œ๊ตญ์–ด

 

ํŒŒ์ด์ฌ ์„ค์น˜

 

๊ฐ€์ƒ ํ™˜๊ฒฝ ์„ธํŒ…

ํ„ฐ๋ฏธ๋„ > ์ƒˆ ํ„ฐ๋ฏธ๋„ : ํ„ฐ๋ฏธ๋„ ์ฐฝ ์ผœ๊ธฐ

๊ฐ€์ƒ ํ™˜๊ฒฝ ์„ค์ •ํ•˜๊ธฐ

๋ช…๋ น์–ด : python -m venv myenv(๊ฐ€์ƒํ™˜๊ฒฝ์ด๋ฆ„)

์œ„ ๋ช…๋ น์–ด ์น˜๋ฉด ๊ฐ€์ƒํ™˜๊ฒฝ ํด๋” ์ƒ์„ฑ๋จ.

์œ„ ํด๋” ๊ฒฝ๋กœ๋กœ ์ฐพ์•„๊ฐ€์„œ activate ์‹คํ–‰ ์‹œ์ผœ์ฃผ๊ธฐ

ํด๋” ์ฐพ์•„๊ฐ€๋Š” ๋ช…๋ น์–ด cd

์•„๋ž˜ ๋ช…๋ น์–ด ์ฐจ๋ก€๋Œ€๋กœ ์ž…๋ ฅ.

cd myenv

cd Scripts

activate (ํŒŒ์ผ ์‹คํ–‰์‹œ์—๋Š” cd ์•ˆ ์”€)

 

ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜

 

pip list : ์„ค์น˜๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ™•์ธ ๋ช…๋ น์–ด

3๊ฐ€์ง€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜

pip install pandas 

pip install  selenium

pip install lxml

 

chromedriver ์„ค์น˜

์›Œ์ŠคํฌํŽ˜์ด์Šค ํด๋” ์•ˆ์— chromedriver.exe ๋„ฃ๊ธฐ

์ƒˆ ํŒŒ์ผ market_cap.py ์ƒ์„ฑ.

 

๋ช…๋ น์ฐฝ์— python ์ž…๋ ฅ : python ์ž…๋ ฅ ์ฐฝ์œผ๋กœ ๋ณ€ํ™˜๋จ

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.maximize_window() #์ฐฝ ์ตœ๋Œ€ํ™”

# 1. ํŽ˜์ด์ง€ ์ด๋™
url = 'https://finance.naver.com/sise/sise_market_sum.naver?&page='
browser.get(url)

#2. ์กฐํšŒ ํ•ญ๋ชฉ ์ดˆ๊ธฐํ™” (์ฒดํฌ๋˜์–ด ์žˆ๋Š” ๋ณด๊ธฐ ํ•ญ๋ชฉ ์ฒดํฌ ํ•ด์ œ)
checkboxes = browser.find_elements(By.NAME, 'fieldIds')
for checkbox in checkboxes:
    if checkbox.is_selected(): #์ฒดํฌ๋œ ์ƒํƒœ
        checkbox.click() #ํด๋ฆญ(์ฒดํฌ ํ•ด์ œ)

 

 

์˜์—…์ด์ต, ์ž์‚ฐ์ด๊ณ„, ๋งค์ถœ์•ก ์ฒดํฌ๋ฐ•์Šค ํ•˜์—ฌ ์กฐํšŒ

(์–ต)์€ lable ๋ฐ–์— ์žˆ๊ธฐ ๋•Œ๋ฌธ์— vlaue ๊ฐ’ ๊ฐ€์ ธ์˜ค๊ธฐ ์• ๋งคํ•จ > ์ƒ๋žต

#3. ์กฐํšŒ ํ•ญ๋ชฉ ์„ค์ •(์›ํ•˜๋Š” ํ•ญ๋ชฉ ์ฒดํฌ ํ›„ ์ ์šฉํ•˜๊ธฐ)
items_to_selected = ['์˜์—…์ด์ต', '์ž์‚ฐ์ด๊ณ„', '๋งค์ถœ์•ก']
for checkbox in checkboxes:
    parent = checkbox.find_element(By.XPATH, '..') # ..์˜ ์˜๋ฏธ > ์ƒ์œ„๊ฐ์ฒด
    label = parent.find_element(By.TAG_NAME, 'label')
    # print(label.text) #์ด๋ฆ„ ํ™•์ธ
    # ์›ํ•˜๋Š” ํ•ญ๋ชฉ์˜ ์ฒดํฌ๋ฐ•์Šค ์ฒดํฌ
    if label.text in items_to_selected : #์„ ํƒ ํ•ญ๋ชฉ๊ณผ ์ผ์น˜ํ•˜๋ฉด
        checkbox.click() #์ฒดํฌ

#4. ์ ์šฉํ•˜๊ธฐ ๋ฒ„ํŠผ ํด๋ฆญ (์ ์šฉํ•˜๊ธฐ ๋ฒ„ํŠผ aํƒœ๊ทธ์— jsํ•จ์ˆ˜๋กœ ๋˜์–ด ์žˆ์Œ)
btn_apply = browser.find_element(By.XPATH, '//a[@href="javascript:fieldSubmit()"]')
btn_apply.click()

์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ถ”์ถœ

ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import

import pandas as pd
import os
#5. ๋ฐ์ดํ„ฐ ์ถ”์ถœ
#page 1~41๊นŒ์ง€ ์žˆ์Œ > ๋ฐ˜๋ณต
for idx in range(1, 42) :
    #์‚ฌ์ „ ์ž‘์—… (ํŽ˜์ด์ง€ ์ด๋™)
    browser.get(url + str(idx)) # fianace.naver.com/...~page=1,2,3... (url ์ž์ฒด String์ด๊ธฐ ๋–„๋ฌธ์— idx๋ฅผ ๋ฌธ์ž๋กœ ๋ณ€๊ฒฝ)
    # ๋ฐ์ดํ„ฐ ์ถ”์ถœ
    df = pd.read_html(browser.page_source)[1]
    df.dropna(axis='index', how='all', inplace=True)
    df.dropna(axis='columns', how='all', inplace=True)
    if len(df) == 0: # ๋” ์ด์ƒ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ 
        break
    
    #6. ํŒŒ์ผ ์ €์žฅ (os import)
    f_name = 'sise.csv'
    if os.path.exists(f_name) : # ํŒŒ์ผ์ด ์žˆ๋‹ค๋ฉด
        df.to_csv(f_name, encoding='utf-8-sig', index=False, mode='a', header=False) # ํŒŒ์ผ ์ˆ˜์ •์„ ์œ„ํ•ด mode = append, header ์ถ”๊ฐ€ x
    else : # ํŒŒ์ผ์ด ์—†๋‹ค๋ฉด (ํ—ค๋” ํฌํ•จ)
        df.to_csv(f_name, encoding='utf-8-sig', index=False)
    print(f'{idx} ํŽ˜์ด์ง€ ์™„๋ฃŒ')

# ๋ธŒ๋ผ์šฐ์ € ์ข…๋ฃŒ
browser.quit()

์‚ฌ์ด๋“œ์— ํŒŒ์ผ ์ƒ๊น€!

์ƒ์„ฑ๋œ csv ํŒŒ์ผ