์ƒˆ์†Œ์‹

๐Ÿ ํŒŒ์ด์ฌ (Python)/-- ์›น ํฌ๋กค๋ง (web crawling)

ํŒŒ์ด์ฌ ํฌ๋กค๋ง (Python crawling) - urllib ๋กœ ์ด๋ฏธ์ง€ , html ๋ฌธ์„œ ๋‹ค์šด๋ฐ›๊ธฐ.

  • -

https://docs.python.org/ko/3/library/urllib.request.html#module-urllib.request

 

urllib.request — Extensible library for opening URLs — Python 3.8.2 ๋ฌธ์„œ

urllib.request — Extensible library for opening URLs Source code: Lib/urllib/request.py The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, coo

docs.python.org

 

ํŒŒ์ด์ฌ์˜ ๊ธฐ๋ณธ๋ชจ๋“ˆ์ธ urllib์˜ requestํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด HTTP์ •๋ณด๋ฅผ ์ˆ˜์‹ , ์ฝ๊ธฐ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

ex) ์ด๋ฏธ์ง€ํŒŒ์ผ, html(ํŽ˜์ด์ง€ ์†Œ์Šค)

 

*reqest.urlretrieve(url , ํŒŒ์ผ๊ฒฝ๋กœ)

#urllib
#HTTP ์ •๋ณด ์ˆ˜์‹  ํ•˜๊ธฐ
import urllib.request as req

#ํŒŒ์ผ URL
image_url = 'http://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fk.kakaocdn.net%2Fdn%2F18CAR%2FbtqzK1rSB5g%2FXR4oK6MYIBHJEVPj7rgYBk%2Fimg.png'
html_url = 'http://google.com'

#๋‹ค์šด ๋ฐ›์„ ๊ฒฝ๋กœ
save_path1 = '..test1.jpg'
save_path2 = '..index.html'


try:
    file1 , header1 = req.urlretrieve(img_url, save_path1) # ํ—ค๋”์ •๋ณด์™€ ํŒŒ์ผ๊ฒฝ๋กœ๋ฅผ ๋ฆฌํ„ดํ•œ๋‹ค.
    file2 , header2 = req.urlretrieve(html_url, save_path2)
except Exception as e:
    print("Download failed")
    print(e)
else:
    print(header1)
    print(header2)

    #๋‹ค์šด๋กœ๋“œ ํŒŒ์ผ ์ •๋ณด
    print('Filename1 {}'.format(file1))
    print('Filename2 {}'.format(file2))
    print()
    print("Download Succeed")

urlretrieve : ์ง€์ •ํ•ด๋‘” url์—์„œ ํŒŒ์ผ์„ ์ €์žฅํ•œ ๋’ค , ํ—ค๋”์ •๋ณด์™€ ํŒŒ์ผ๊ฒฝ๋กœ๋ฅผ ๋ฆฌํ„ดํ•œ๋‹ค.

 

ํ—ค๋” ์ •๋ณด์™€ ํŒŒ์ผ์ •๋ณด ์ถœ๋ ฅ๋‚ด์šฉ
Date: Tue, 25 Feb 2020 16:35:31 GMT
Server: PWS/8.3.2.7
X-Px: ms h0-s378.p63-icn ( h0-s411.p63-icn), rf-ht h0-s411.p63-icn ( h0-s776.p61-icn), rf-ht h0-s776.p61-icn ( origin)
Age: 0
Cache-Control: max-age=7200
Expires: Tue, 25 Feb 2020 18:35:31 GMT
Accept-Ranges: bytes
Content-Length: 49119
Content-Type: image/png
Last-Modified: Mon, 06 Jan 2020 13:17:56 GMT
Connection: close


Date: Tue, 25 Feb 2020 16:35:31 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2020-02-25-16; expires=Thu, 26-Mar-2020 16:35:31 GMT; path=/; domain=.google.com; Secure
Set-Cookie: NID=198=uijrh2ejBDB4SLoz54AmFcJpl4FJcYZyo9enSWgkb7YBaR7dx1U1kKXxiTdTETgQ63hh--lXDK7ophlQKkUwAL7nUR6BgfGQ7V_RyBJxgam6M_2124ap4mWG1NoE_dDQq5AZoC3Jqxb33CVo-oY0DsXcaZSf1_klbQfRgB0sTB4; expires=Wed, 26-Aug-2020 16:35:31 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding
Connection: close


Filename1 ..test1.jpg
Filename2 ..index.html

 

*urlopen์„ ์ด์šฉํ•œ ๋‹ค์šด๋กœ๋“œ ๋ฐ Http error , URL error ์˜ˆ์™ธ์ฒ˜๋ฆฌ

#urlopen ํ•จ์ˆ˜

import urllib.request as req

from urllib.error import URLError, HTTPError 

# ๋‹ค์šด๋กœ๋“œ ๊ฒฝ๋กœ ๋ฐ ํŒŒ์ผ๋ช…
path_list = ["..test2.jpg", "..index2.html"]

# ๋‹ค์šด๋กœ๋“œ ๋ฆฌ์†Œ์Šค url
target_url = ["https://movie-phinf.pstatic.net/20190625_168/1561426986010A3uBi_JPEG/movie_image.jpg", "http://infinitt.tisotry.com"]

for i, url in enumerate(target_url):
    #์˜ˆ์™ธ ์ฒ˜๋ฆฌ
    try :
        # ์›น ์ˆ˜์‹  ์ •๋ณด ์ฝ๊ธฐ
        response = req.urlopen(url)
        # ์ˆ˜์‹  ๋‚ด์šฉ
        contents = response.read()

        print("--------------1----------------")
        # ์ƒํƒœ ์ •๋ณด ์ถœ๋ ฅ (200๋ฒˆ์ด ์ •์ƒ.)
        print("Header Info-{} : {}". format(i, response.info()))
        print("HTTP Status Code: {}".format(response.getcode()))
        print()
        print("---------------2---------------")

        with open(path_list[i], 'wb') as c : #write binary
            c.write(contents)

    except HTTPError as e :
        print("Download Failed.")
        print("HTTPError Code:", e.code)
    except URLError as u :
        print("Download Failed.")
        print("URL Error Reason:", e.reason)

    else : 
        print()
        print("Download Succeed.")
Contents

ํฌ์ŠคํŒ… ์ฃผ์†Œ๋ฅผ ๋ณต์‚ฌํ–ˆ์Šต๋‹ˆ๋‹ค

์ด ๊ธ€์ด ๋„์›€์ด ๋˜์—ˆ๋‹ค๋ฉด ๊ณต๊ฐ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.