🐍 파이썬 (Python)

파이썬 크롤링 (Python crawling) - urllib 로 이미지 , html 문서 다운받기.

https://docs.python.org/ko/3/library/urllib.request.html#module-urllib.request

urllib.request — Extensible library for opening URLs — Python 3.8.2 문서

urllib.request — Extensible library for opening URLs Source code: Lib/urllib/request.py The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, coo

docs.python.org

파이썬의 기본모듈인 urllib의 request함수를 통해 HTTP정보를 수신, 읽기가 가능하다.

ex) 이미지파일, html(페이지 소스)

*reqest.urlretrieve(url , 파일경로)

#urllib
#HTTP 정보 수신 하기
import urllib.request as req

#파일 URL
image_url = 'http://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdna%2F18CAR%2FbtqzK1rSB5g%2FAAAAAAAAAAAAAAAAAAAAAOzJvR7MSy0P3ctFjFxBJosh5wD8gAelVdmlvtOQFhpv%2Fimg.png%3Fcredential%3DyqXZFxpELC7KVnFOS48ylbz2pIh7yKj8%26expires%3D1764514799%26allow_ip%3D%26allow_referer%3D%26signature%3DfHLxWz1Tmq76TJsZV6dD0x8CC4o%253D'
html_url = 'http://google.com'

#다운 받을 경로
save_path1 = '..test1.jpg'
save_path2 = '..index.html'


try:
    file1 , header1 = req.urlretrieve(img_url, save_path1) # 헤더정보와 파일경로를 리턴한다.
    file2 , header2 = req.urlretrieve(html_url, save_path2)
except Exception as e:
    print("Download failed")
    print(e)
else:
    print(header1)
    print(header2)

    #다운로드 파일 정보
    print('Filename1 {}'.format(file1))
    print('Filename2 {}'.format(file2))
    print()
    print("Download Succeed")

urlretrieve : 지정해둔 url에서 파일을 저장한 뒤 , 헤더정보와 파일경로를 리턴한다.

헤더 정보와 파일정보 출력내용

Date: Tue, 25 Feb 2020 16:35:31 GMT
Server: PWS/8.3.2.7
X-Px: ms h0-s378.p63-icn ( h0-s411.p63-icn), rf-ht h0-s411.p63-icn ( h0-s776.p61-icn), rf-ht h0-s776.p61-icn ( origin)
Age: 0
Cache-Control: max-age=7200
Expires: Tue, 25 Feb 2020 18:35:31 GMT
Accept-Ranges: bytes
Content-Length: 49119
Content-Type: image/png
Last-Modified: Mon, 06 Jan 2020 13:17:56 GMT
Connection: close

Date: Tue, 25 Feb 2020 16:35:31 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2020-02-25-16; expires=Thu, 26-Mar-2020 16:35:31 GMT; path=/; domain=.google.com; Secure
Set-Cookie: NID=198=uijrh2ejBDB4SLoz54AmFcJpl4FJcYZyo9enSWgkb7YBaR7dx1U1kKXxiTdTETgQ63hh--lXDK7ophlQKkUwAL7nUR6BgfGQ7V_RyBJxgam6M_2124ap4mWG1NoE_dDQq5AZoC3Jqxb33CVo-oY0DsXcaZSf1_klbQfRgB0sTB4; expires=Wed, 26-Aug-2020 16:35:31 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding
Connection: close

Filename1 ..test1.jpg
Filename2 ..index.html

*urlopen을 이용한 다운로드 및 Http error , URL error 예외처리

#urlopen 함수

import urllib.request as req

from urllib.error import URLError, HTTPError 

# 다운로드 경로 및 파일명
path_list = ["..test2.jpg", "..index2.html"]

# 다운로드 리소스 url
target_url = ["https://movie-phinf.pstatic.net/20190625_168/1561426986010A3uBi_JPEG/movie_image.jpg", "http://infinitt.tisotry.com"]

for i, url in enumerate(target_url):
    #예외 처리
    try :
        # 웹 수신 정보 읽기
        response = req.urlopen(url)
        # 수신 내용
        contents = response.read()

        print("--------------1----------------")
        # 상태 정보 출력 (200번이 정상.)
        print("Header Info-{} : {}". format(i, response.info()))
        print("HTTP Status Code: {}".format(response.getcode()))
        print()
        print("---------------2---------------")

        with open(path_list[i], 'wb') as c : #write binary
            c.write(contents)

    except HTTPError as e :
        print("Download Failed.")
        print("HTTPError Code:", e.code)
    except URLError as u :
        print("Download Failed.")
        print("URL Error Reason:", e.reason)

    else : 
        print()
        print("Download Succeed.")

'🐍 파이썬 (Python)' 카테고리의 다른 글

파이썬 동작 실행 시간 (연산시간) 확인 (0)	2020.03.10
파이썬 (Python) 자주 사용되는 문자열 함수 (코딩테스트, 알고리즘) (0)	2020.03.08
파이썬 (python) Sqlite 데이터베이스 Dump 백업, 덤프 (0)	2020.02.24
(python) Sqlite 데이터베이스 읽기 , 수정 , 삭제 (create, read, update, delete) (0)	2020.02.24
(python) Sqlite 데이터 베이스 생성, 테이블 조회, 데이터 삽입 (create, insert ) (0)	2020.02.23

Contents

새소식

인기 검색어

파이썬 크롤링 (Python crawling) - urllib 로 이미지 , html 문서 다운받기.

*reqest.urlretrieve(url , 파일경로)

*urlopen을 이용한 다운로드 및 Http error , URL error 예외처리

'🐍 파이썬 (Python)' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바