[웹 크롤링] 로컬 html 파일에서 데이터 가져오기

Programming Basics

[웹 크롤링] 로컬 html 파일에서 데이터 가져오기

ISLA! 2023. 8. 4. 11:18

웹 크롤링 시작하기

보통 크롤링 대상은 웹페이지이지만, 간단히 로컬파일 데이터를 크롤링하는 것부터 시작해보자.

❗어떤 프로젝트를 하든, 디렉토리(폴더)와 파일의 경로가 중요하다.
>> 본 포스팅에서는 최상위 폴더를 편의상 crawling 으로 가정한다.

어떤 프로젝트를 하든, 디렉토리(폴더)와 파일의 경로가 중요하다.

본 포스팅에서는 최상위 폴더를 편의상 crawling 으로 가정한다.

1. 필요한 라이브러리 설치

vs code 열기
requirements.txt 를 통해 필요한 라이브러리 작성 (프로젝트 최상위 폴더에)
터미널에 명령어 입력하여 라이브러리 설치
- pip install -r requirements.txt

selenium
webdriver-manager
numpy
pandas
scikit-learn
matplotlib
seaborn
plotly
jupyterlab
requests
beautifulsoup4

2. html 파일 만들기

폴더 경로를 고려하여 아래와 같이 index.html 생성
폴더 생성 : html
html 폴더 하위에 index.html 생성
crawling > html > index.html

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
        <title>크롤링 웹페이지</title>
    </head>
    <body>
        <div class="mulcam1">
            <p>여기는 크롤링 하지 마세요!</p>
        </div>
        <div class="fakecampus">
            <p>여기는 크롤링 해주세요!</p>
        </div>  
    </body>
</html>

3. ch01.py 생성

파일 생성 : ch01.py
- 파일 경로 : crawling > ch01.py
- 파일 만들기 명령어 : touch ch01.py
파일 내용 :
- 기본적으로 크롤링에 BeautifulSoup 라이브러리를 쓴다.
- main 함수를 정의하고, 여기에 html 파일을 불러와 parser로 사용함을 명시

[ch01.py 작성]

# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup

def main():
    # index.html을 불러와서 BeautifulSoup 객체 초기화
    # 웹에서 응답을 할 때, html, xml, json 등 여러가지 방식이 존재함
    soup = BeautifulSoup(open("html/index.html", encoding="utf-8"), "html.parser")
    print(soup)

if __name__ == "__main__":
    main()

🔍 html.parser 는 뭘까?

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Beautiful Soup Documentation — Beautiful Soup 4.12.0 documentation

Beautiful Soup Documentation Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers h

www.crummy.com

기본적으로 공식문서 확인 : Beautifulsoup 구글링해서 공식문서 페이지로!
샘플코드를 직접 복붙해서 실행해보며 스터디
예시 이미지

4. beautifulsoup 메서드로 로컬파일 크롤링하기

다음은 index.html 파일이다.
노란색 부분만 가져오고 싶다면, 어떻게 코드를 작성해야할까?

👉 기본적으로 div 태그로 접근하는 것이 좋다

div 태그에서 원하는 클래스를 지정하고
<p> 태그로 묶인 모든 정보를 가져옴
그 중, 2번째 텍스트만 가져옴

# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup

def main():
    # index.html을 불러와서 BeautifulSoup 객체 초기화
    # 웹에서 응답을 할 때, html, xml, json 등 여러가지 방식이 존재함
    # 메서드 여러가지 사용해보기
    
    fake_str = soup.find('div', class_='fakecampus').find_all('p')
    print(fake_str[2].get_text())


if __name__ == "__main__":
    main()

👉 결과는 터미널에서 확인 ! >> 다음과 같이 원하는 부분이 출력되는지 확인한다

728x90

저작자표시 비영리 변경금지 (새창열림)