Web Scraping - python

Porits789 2024. 3. 15. 09:42

2024. 3. 15. 09:42

웹 스크래핑을 위한 python 라이브러리 실습 코드이다. bs4와 selenium에 대해서 간단한 실습코드들을 정리했다.

BeautifulSoup4 실습

Install

pip install beautifulSoup4

객체 생성

import requests
from bs4 import BeautifulSoup

res = request.get("http://example.com")
soup = BeautifulSoup(res.text,"html.parser")

# 보기 편한 출력은 prettify를 사용한다.
# print(soup.prettify())

태그 가져오기 - find

요소를 하나 찾고 싶은 경우 find, 여러개의 경우 find_all을 사용한다.

soup.find("h1")
results=soup.find_all("p")

Locator를 활용하기 - id, class

# id를 이용해 요소 가져오기
soup.find("div",id = "results")
# class를 이용해 요소를 가져오기
result = soup.find("div",class = "page-header")

# text 값을 출력
# result.h1.text.strip()

페이지네이션

페이지가 많은 경우 url을 통해 page를 변경 가능하다. 따라서 request.get을 반복적으로 수행하면서 값을 가져오면 된다.

for i in range(1,10):
	res = request.get(f"http://example.com/page={i}")

동적 웹페이지

정적 웹페이지 : HTML 내용이 고정된 사이트를 말한다.
동적 웹페이지 : HTML 내용이 변경되는 사이트. JS를 주로 사용한다.
=>동적 웹페이지의 경우 비동기 처리가 이루어지기 때문에 데이터의 처리가 늦어지는 경우 데이터가 완전하지 않은 경우가 발생한다.
파이썬에서는 Selenium을 활용하여 이러한 웹 페이지들을 스크래핑한다.

Selenium 라이브러리

: selenium은 Python을 이용해서 웹 브라우저를 조작할 수 있는 자동화 프레임워크이다.

설치

: 라이브러리 설치와 webdriver-manager를 미리 설치해두고 사용한다.

# 주피터 환경에서 설치
%pip install selenium
%pip install webdriver-manager
# 아나콘다에 설치
conda install -c conda-forge selenium

크롬창 띄우기

WebDriver 모듈을 이용하여 크롬창을 띄우는 방법이다.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# 다음 코드를 통해서 드라이버를 같이 불러온다.
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("http://www.example.com")

# with-as 버전
with webdriver.Chrome(service=Service(ChromeDriverManager().install())) as driver:
    driver.get("http://www.example.com")

요소 찾기 find_element

: By와 find_element를 활용하여 페이지 내 요소를 찾을 수 있다.

.find_element(by, target) : 하나
.find_elements(by, target) : 여러개

# p태그를 찾는 예시
with webdriver.Chrome(service=Service(ChromeDriverManager().install())) as driver:
    driver.get("http://www.example.com")
    print(driver.find_element(By.TAG_NAME, "p").text)

Wait

: 동적 페이지를 스크래핑하기 위해서는 페이지 로딩시간을 기다리는 것이 필요한 경우가 존재한다.

Implicit Wait : 암시적 기다림, 로딩이 다 될 때까지의 한계 시간을 의미한다. driver.implicitly_wait(5)
Explicit Wait : 명시적 기다림, until 메서드를 활용해서 target 요소가 존재할 때 까지 기다린 후 다음 명령을 수행한다.

from selenium.webdriver.support import expected_conditions as EC
# 요소가 존재하면 그 요소를 반환한다.
element = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,'')))

이벤트 처리 (마우스,키보드)

: ActionChains를 활용하여 마우스와 키보드 입력과 같은 동작을 수행할 수 있다.

from selenium.webdriver.common.actions.action_builder import ActionBuilder
from selenium.webdriver import Keys, ActionChains

# 버튼 클릭
button = driver.find_element(By.XPATH,'')
ActionChains(driver).click(button).perform()

# input 요소에 값 전달.
text_input = driver.find_element(By.XPATH,'')
ActionChains(driver).send_keys_to_element(text_input, "input_text").perform()

'Python' 카테고리의 다른 글

Python > Selenium - Linux환경에서의 Headless 옵션 적용 시 에러 (8)	2024.09.06

Pori_IT