์ƒˆ์†Œ์‹

๐Ÿ ํŒŒ์ด์ฌ (Python)

ํŒŒ์ด์ฌ ํฌ๋กค๋ง (Python crawling) - urllib ๋กœ ์ด๋ฏธ์ง€ , html ๋ฌธ์„œ ๋‹ค์šด๋ฐ›๊ธฐ.

  • -

https://docs.python.org/ko/3/library/urllib.request.html#module-urllib.request

 

urllib.request โ€” Extensible library for opening URLs โ€” Python 3.8.2 ๋ฌธ์„œ

urllib.request โ€” Extensible library for opening URLs Source code: Lib/urllib/request.py The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world โ€” basic and digest authentication, redirections, coo

docs.python.org

 

ํŒŒ์ด์ฌ์˜ ๊ธฐ๋ณธ๋ชจ๋“ˆ์ธ urllib์˜ requestํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด HTTP์ •๋ณด๋ฅผ ์ˆ˜์‹ , ์ฝ๊ธฐ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

ex) ์ด๋ฏธ์ง€ํŒŒ์ผ, html(ํŽ˜์ด์ง€ ์†Œ์Šค)

 

#urllib #HTTP ์ •๋ณด ์ˆ˜์‹  ํ•˜๊ธฐ import urllib.request as req #ํŒŒ์ผ URL image_url = 'http://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fk.kakaocdn.net%2Fdn%2F18CAR%2FbtqzK1rSB5g%2FXR4oK6MYIBHJEVPj7rgYBk%2Fimg.png' html_url = 'http://google.com' #๋‹ค์šด ๋ฐ›์„ ๊ฒฝ๋กœ save_path1 = '..test1.jpg' save_path2 = '..index.html' try: file1 , header1 = req.urlretrieve(img_url, save_path1) # ํ—ค๋”์ •๋ณด์™€ ํŒŒ์ผ๊ฒฝ๋กœ๋ฅผ ๋ฆฌํ„ดํ•œ๋‹ค. file2 , header2 = req.urlretrieve(html_url, save_path2) except Exception as e: print("Download failed") print(e) else: print(header1) print(header2) #๋‹ค์šด๋กœ๋“œ ํŒŒ์ผ ์ •๋ณด print('Filename1 {}'.format(file1)) print('Filename2 {}'.format(file2)) print() print("Download Succeed")

urlretrieve : ์ง€์ •ํ•ด๋‘” url์—์„œ ํŒŒ์ผ์„ ์ €์žฅํ•œ ๋’ค , ํ—ค๋”์ •๋ณด์™€ ํŒŒ์ผ๊ฒฝ๋กœ๋ฅผ ๋ฆฌํ„ดํ•œ๋‹ค.

 

ํ—ค๋” ์ •๋ณด์™€ ํŒŒ์ผ์ •๋ณด ์ถœ๋ ฅ๋‚ด์šฉ
Date: Tue, 25 Feb 2020 16:35:31 GMT
Server: PWS/8.3.2.7
X-Px: ms h0-s378.p63-icn ( h0-s411.p63-icn), rf-ht h0-s411.p63-icn ( h0-s776.p61-icn), rf-ht h0-s776.p61-icn ( origin)
Age: 0
Cache-Control: max-age=7200
Expires: Tue, 25 Feb 2020 18:35:31 GMT
Accept-Ranges: bytes
Content-Length: 49119
Content-Type: image/png
Last-Modified: Mon, 06 Jan 2020 13:17:56 GMT
Connection: close


Date: Tue, 25 Feb 2020 16:35:31 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2020-02-25-16; expires=Thu, 26-Mar-2020 16:35:31 GMT; path=/; domain=.google.com; Secure
Set-Cookie: NID=198=uijrh2ejBDB4SLoz54AmFcJpl4FJcYZyo9enSWgkb7YBaR7dx1U1kKXxiTdTETgQ63hh--lXDK7ophlQKkUwAL7nUR6BgfGQ7V_RyBJxgam6M_2124ap4mWG1NoE_dDQq5AZoC3Jqxb33CVo-oY0DsXcaZSf1_klbQfRgB0sTB4; expires=Wed, 26-Aug-2020 16:35:31 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding
Connection: close


Filename1 ..test1.jpg
Filename2 ..index.html
#urlopen ํ•จ์ˆ˜ import urllib.request as req from urllib.error import URLError, HTTPError # ๋‹ค์šด๋กœ๋“œ ๊ฒฝ๋กœ ๋ฐ ํŒŒ์ผ๋ช… path_list = ["..test2.jpg", "..index2.html"] # ๋‹ค์šด๋กœ๋“œ ๋ฆฌ์†Œ์Šค url target_url = ["https://movie-phinf.pstatic.net/20190625_168/1561426986010A3uBi_JPEG/movie_image.jpg", "http://infinitt.tisotry.com"] for i, url in enumerate(target_url): #์˜ˆ์™ธ ์ฒ˜๋ฆฌ try : # ์›น ์ˆ˜์‹  ์ •๋ณด ์ฝ๊ธฐ response = req.urlopen(url) # ์ˆ˜์‹  ๋‚ด์šฉ contents = response.read() print("--------------1----------------") # ์ƒํƒœ ์ •๋ณด ์ถœ๋ ฅ (200๋ฒˆ์ด ์ •์ƒ.) print("Header Info-{} : {}". format(i, response.info())) print("HTTP Status Code: {}".format(response.getcode())) print() print("---------------2---------------") with open(path_list[i], 'wb') as c : #write binary c.write(contents) except HTTPError as e : print("Download Failed.") print("HTTPError Code:", e.code) except URLError as u : print("Download Failed.") print("URL Error Reason:", e.reason) else : print() print("Download Succeed.")

ํฌ์ŠคํŒ… ์ฃผ์†Œ๋ฅผ ๋ณต์‚ฌํ–ˆ์Šต๋‹ˆ๋‹ค

์ด ๊ธ€์ด ๋„์›€์ด ๋˜์—ˆ๋‹ค๋ฉด ๊ณต๊ฐ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.