문서 데이터

HTTP

HTTP는 HyperText Transfer Protocol의 약자로 인터넷을 통해 데이터를 주고 받기 위한 서버/클라이언트 모델을 따르는 프로토콜이다. 응용 수준의 규약(프로토콜)으로 TCP/IP 위에서 작동한다. HTTP는 HTML, 이미지, 동영상, 오디오, 텍스트 문서 등의 데이터를 전송할 수 있다.

웹 브라우저(클라이언트)를 통해 웹 사이트(서버)에 접속하여 원하는 정보를 보는 절차는 다음 그림과 같다.

클라이언트에서 요청(request) 메시지를 보내면 서버에는 요청 정보를 처리하여 응답(response) 메시지를 보낸다. 요청 메시지와 응답 메시지 형식은 다음과 같다.

요청(request)

다음은 브라우저가 서버에게 요청하는 메시지의 예제이다.

1
2
3
4
5
6
7
GET / HTTP/1.1\r\n
Host: www.google.com\r\n
User-Agent: python-requests/2.18.4\r\n
Accept-Encoding: gzip, deflate\r\n
Accept: */*\r\n
Connection: keep-alive\r\n
\r\n

요청 메시지는 요청 줄과 헤더, 메시지 몸통으로 구성되어 있다. 요청 줄은 요청 메소드(get, post, put, delete, head, options, trace), 요청 uri, http 버전이 나온다.

1
GET / HTTP/1.1\r\n

요청 메소드

메소드는 서버에게 어떤 종류의 요청인지 알려준다.

메소드	설명
GET	서버에 정보를 요청한다.
POST	폼(데이터)을 서버에 넘겨주기 위해서 사용한다.
PUT	서버의 데이터를 업데이트하기 위해 사용한다.
DELETE	서버 데이터를 삭제하기 위해서 사용한다.
HEAD	HTTP 헤더 정보만 요청한다. 해당 자원이 존재하는지 혹은 서버에 문제가 없는지 확인하기 위해서 사용한다.
OPTIONS	웹서버가 지원하는 메소드의 종류를 알아본다.
TRACE	클라이언트의 요청을 그대로 돌려보낸다.

요청 URI

통합 자원 식별자(Uniform Resource Identifier, URI)는 정보가 서버에 위치해 있는 주소를 가리킨다. 위의 예에서 /가 URI이다.

요청 헤더

요청 줄 다음으로 나오는 것은 요청 헤더이다. 헤더는 브라우저 종류, 웹 페이지 언어, 서버 주소 등의 메타 데이터가 나온다. 자세한 것은 HTTP/1.1 세부 명세를 확인한다.

응답(response)

브라우저를 통해서 서버에 요청을 하면 서버가 응답을 하면서 메시지를 보낸다. 아래는 응답 메시지 예제이다.

1
2
3
4
5
6
7
8
9
10
11
HTTP/1.1 200 OK
'Date': 'Mon, 20 Nov 2017 01:11:02 GMT',
'Expires': '-1',
'Cache-Control':'private, max-age=0',
'Content-Type': 'text/html; charset=EUC-KR',
'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."',
'Content-Encoding': 'gzip', 'Server': 'gws',
'Content-Length': '5065',
'X-XSS-Protection': '1; mode=block',
'X-Frame-Options': 'SAMEORIGIN',
'Set-Cookie': '1P_JAR=2017-11-20-01; expires=Wed, 20-Dec-2017 01:11:02 GMT; path=/; domain=.google.co.kr, NID=117=qpfOPSGFis-r3WpI7ejpkONxTkZj0W0LjhFgCJSmY3S4rGi2RFBjxHEvB_JvaWFw9OqLdP7TPDp9RxTU0VeVaJ0F5pR7jgcdmIFmL9G_XpGCnuotMM7p3V7yhxU-p3Kf; expires=Tue, 22-May-2018 01:11:02 GMT; path=/; domain=.google.co.kr; HttpOnly'

웹 브라우저를 이용하면 개발자 도구 메뉴를 통해서 http 요청/응답에 대한 정보를 확인할 수 있다. 크롬인 경우는 F12를 눌러 개발자 도구를 열어 Network 탭을 클릭해보면 요청, 응답 메시지를 확인할 수 있다.

직접하기

크롬을 이용하여 고려대학교 세종 홈페이지(http://sejong.korea.ac.kr/kr)의 요청/응답 메시지를 확인하자.

데이터 수집 및 파싱(Parsing)

웹 페이지 읽기

requests 모듈을 이용하여 웹 페이지를 읽어 온다. requests 모듈은 http 요청및 응답을 처리하는 편리한 방법들을 제공한다.

다음(daum) 홈페이지 출력

다음(daum) 홈페이지에 접속해서 HTML 문서를 가져와 화면에 출력하는 예이다.

1
2
3
4
5
6
7
import requests

resp = requests.get('http://daum.net') # 웹 사이트 접속

if (resp.status_code == requests.codes.ok): # 응답이 정상
    html = resp.text # 웹 페이지 읽기
    print(html.split('\n')[0:10]) # 웹 페이지 10줄 출력

['<!DOCTYPE html>', '<html lang="ko" class="">', '<head>', '<meta charset="utf-8"/>', '<title>Daum</title>', '<meta property="og:url" content="https://www.daum.net/">', '<meta property="og:type" content="website">', '<meta property="og:title" content="Daum">', '<meta property="og:image" content="//i1.daumcdn.net/svc/image/U03/common_icon/5587C4E4012FCD0001">', '<meta property="og:description" content="나의 관심 콘텐츠를 가장 즐겁게 볼 수 있는 Daum">']

requests.get(사이트주소)은 요청 메시지의 get 메소드를 이용하여 사이트 주소의 페이지를 요청한다. resp.status_code는 응답 객체의 상태를 나타내는 것으로 정상이면 200을 반환한다. requests.codes.ok는 정상 코드 200을 나타내는 상수이다. resp.text은 웹 페이지의 html 페이지를 반환한다. 클라이언트의 잘못된 요청에 대해 서버는 여러 가지 에러를 반환할 수 있다.(에러 코드 4xx, 5xx) 이러한 응답에 대해서 Response.raise_for_status() 메소드를 이용해 예외를 발생시킬 수 있다.

구글 검색 결과 출력

구글에 접속해서 원하는 단어를 검색하여 출력할 수 있다. 구글에서 검색어를 입력하면 주소창에 search?q=검색어와 같은 문자열이 입력되어 있는 것을 확인할 수 있다. 이것을 이용해 다음과 같이 직접 검색어를 사이트 주소와 함께 입력해 주면 검색 결과를 얻을 수 있다.

1
2
3
4
5
6
import requests

resp = requests.get('https://google.co.kr/search?q=인공지능')
if (resp.status_code == requests.codes.ok):
    html = resp.text
    print(html[:100], '...중간 생략...', html[-100:], sep='\n')

<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="ko"><head><me
...중간 생략...
"/client_204?&atyp=i&biw="+a+"&bih="+b+"&ei="+google.kEI);}).call(this);})();</script></body></html>

인공지능이란 단어를 검색한 결과를 출력한 것이다. 내용이 너무 길어 중간 생략을 했다.

직접하기

다음(daum) 사이트에서 "날씨"를 검색하여 결과를 출력하시오.

파싱(Beautiful Soup)

BeautifulSoup 모듈을 이용해서 웹 페이지에서 필요한 정보들을 찾아낼 수 있다. 포털 사이트에서 주요 뉴스 제목을 찾아내거나 검색 사이트에서 원하는 단어를 검색한 결과를 볼 수 있다.

설치

1
2
pip install beautifulsoup4 # 또는
conda install -c anaconda beautifulsoup4 # 아나콘다를 이용할 경우

BeautifulSoup 웹페이지 파싱

웹 문서를 입력받아 bs객체를 만든다. bs 객체를 이용하여 필요한 정보들에 접근해서 원하는 것들을 수집할 수 있다. 원하는 성분으로 접근하는 방법은 여러 가지가 있으나 select() 메소드를 이용하는 방법이 있다. select 메소드의 인자는 CSS(Cascading Style Sheets) selector 조합 문자열을 사용한다. css selector에 대한 자세한 설명은 W3 Schools CSS Selector Reference를 참조한다. 다음은 몇 가지 예를 보여준다.

html 성분(element 또는 tag)은 다음과 같은 형식으로 이루어져 있다.

1
<tag_or_element attribute="value">text</tag_or_element>

다음은 html 예제의 일부이다.

1
2
3
4
5
6
7
<div class="intro"> <!-- div는 성분, class는 속성, "intro"는 class 속성값이다.-->
<p>My name is Donald <span id="Lastname">Duck.</span></p>

<p id="my-Address">I live in Duckburg</p>

<p>I have many friends:</p>
</div>

Selector	예제	설명	CSS 버전
`.class`	`.intro`	`class="intro"`인 모든 성분 선택	1
`#id`	`#firstname`	`id="firstname"`인 모든 성분 선택	1
`*`	`*`	모든 성분 선택	2
`element *`	`*`	`div` 안에 있는(자손) 모든 성분 선택. 중복하면서 선택된다.	2
`element`	`p`	`<p>` 성분 모두 선택	1
`element, element`	`div, p`	`<div>` 또는 `<p>`를 갖는 모든 성분 선택. 중복을 허락하지 않는다.	1
`element element`	`div p`	`<div>` 성분 안에(자식) 모든 `<p>` 성분 선택	1
`element > element`	`div > p`	부모가 `<div>`인 모든 `<p>` 성분 선택	2
`element + element`	`div + p`	`<div>`와 형제이며 `<div>` 바로 아래쪽에 붙어 있는 `<p>` 성분 선택	2
`element1 ~ element2`	`p ~ ul`	`<p>` 와 형제이며 `<p>` 아래쪽에 있는 모든 `<ul>` 성분들 선택	3
`[attribute]`	`[target]`	속성이 `target`인 모든 성분 선택	2
`[attribute=value]`	`[target=_blank]`	속성이 `target`이고 `target`의 값이 `_blank`인 모든 성분 선택	2
`[attribute~=value]`	`[title~=flower]`	`title`속성을 갖고 속성값이 `flower`를 포함하는 모든 성분들 선택	2
`[attribute`\|`=value`]	`[lang`\|`=en`]	속성이 `lang`이고 속성의 값이 `en` 또는 `en-`로 시작하는 모든 성분 선택	2
`:nth-of-type(n)`	`p:nth-of-type(2)`	`<p>`의 부모 아래에 있는 두번째 `<p>`성분 선택	3

1
2
3
4
import bs4

html = "<html><head><title>제목</title></head><body>...생략...</body></html>"
bs = bs4.BeautifulSoup(html, 'html.parser')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import bs4

html = """
<html>

<head>
</head>

<body>
<h1>Welcome to My Homepage</h1>
<div class="intro">
<p>My name is Donald <span id="Lastname">Duck.</span></p>

<p id="my-Address">I live in Duckburg</p>

<p>I have many friends:</p>
</div>

<ul id="Listfriends">
<li>Goofy</li>
<li>Mickey</li>
<li>Daisy</li>
<li>Pluto</li>
</ul>

<p>All my friends are great!<br>
But I really like Daisy!!</p>

<p lang="it" title="Hello beautiful">Ciao bella</p>

<h3>We are all animals!</h3>

<p><b>My latest discoveries have led me to believe that we are all animals:</b></p>

<table>
<thead>
<tr>
<th>Name</th>    <th>Type of Animal</th>
</tr>
</thead>
<tr>
<td>Mickey</td>    <td>Mouse</td>
</tr>
<tr>
<td>Goofey</td>    <td>Dog</td>
</tr>
<tr>
<td>Daisy</td>    <td>Duck</td>
</tr>
<tr>
<td>Pluto</td>    <td>Dog</td>
</tr>
</table>

</body>
</html>
"""

bs = bs4.BeautifulSoup(html, 'html.parser')
bs.select('table')

[<table>
 <thead>
 <tr>
 <th>Name</th> <th>Type of Animal</th>
 </tr>
 </thead>
 <tr>
 <td>Mickey</td> <td>Mouse</td>
 </tr>
 <tr>
 <td>Goofey</td> <td>Dog</td>
 </tr>
 <tr>
 <td>Daisy</td> <td>Duck</td>
 </tr>
 <tr>
 <td>Pluto</td> <td>Dog</td>
 </tr>
 </table>]

다음 html 문서를 이용해서 예제들을 살펴보자.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""

성분들을 찾는다.

1
2
soup = bs4.BeautifulSoup(html_doc, 'html.parser')
soup.select('title')

[<title>The Dormouse's story</title>]

성분이 title인 것을 모두 찾는다.

1
soup.select("p:nth-of-type(3)")

[<p class="story">...</p>]

p의 부모의 자손중 3번째 p를 찾는다.

성분 밑의 성분 찾기

1
soup.select("body a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

body의 자손 중 a 성분을 모두 찾는다.

1
soup.select("html head title")

[<title>The Dormouse's story</title>]

html 자손으로 head 자손 중 title 성분을 모두 찾는다.

성분 바로 밑의 성분 찾기

1
soup.select("head > title")

[<title>The Dormouse's story</title>]

head 성분의 자식 중 title 성분을 모두 찾는다.

1
soup.select("p > a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

p의 자식 중 a인 성분 모두 찾는다.

1
soup.select("p > a:nth-of-type(2)")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

p의 자식 중의 a 성분들 중에서 2번째 성분을 찾는다.

1
soup.select("p > #link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

p의 자식 중 id가 link1인 성분을 찾는다.

1
soup.select("body > a")

[]

body 자식 중 a 성분을 찾지만 없으므로 빈 리스트가 된다.

같은 수준의 성분들 찾기

1
soup.select("#link1 ~ .sister")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

id가 link1인 형제들 중 class 값이 sister인 모든 성분 찾는다.

1
soup.select("#link1 + .sister")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

id가 link1인 형제들 중 id가 link1인 성분 바라 아래 붙어있는 class 값이 sister인 성분을 찾는다.

CSS 클래스에 의한 성분 찾기

1
soup.select(".sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

클래스 값이 sister인 성분들 모두 찾는다.

1
soup.select("[class~=sister]")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

클래스 속성값이 sister를 포함하는 모든 성분을 찾는다.

ID에 의한 성분 찾기

1
soup.select("#link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

1
soup.select("a#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

속성에 의해 찾기

1
soup.select('a[href]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

속성값에 의한 찾기

1
soup.select('a[href="http://example.com/elsie"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

1
soup.select('a[href^="http://example.com/"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1
soup.select('a[href$="tillie"]')

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1
soup.select('a[href*=".com/el"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

네이버 금융 사이트에서 헤드라인 뉴스 제목 발췌

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import requests, bs4

resp = requests.get('http://finance.naver.com/')
resp.raise_for_status()

resp.encoding='euc-kr'
html = resp.text

bs = bs4.BeautifulSoup(html, 'html.parser')
print(bs.prettify()[0:100], "\n.\n.\n.\n", bs.prettify()[-100:])

tags = bs.select('div.news_area h2 a') # 헤드라인 뉴스 제목
title = tags[0].getText()
print("헤드라인 제목: ", title)

<html lang="ko">
 <head>
  <title>
   네이버 금융
  </title>
  <meta content="text/html; charset=utf-8" h 
.
.
.
 , 이미지 리플레시 
jindo.$Fn(mainPageDomReadyFn).attach(document, "domready");
  </script>
 </body>
</html>
헤드라인 제목:  조정받는 코스닥, 반등하는 코스피.....

직접하기

다음(daum) 사이트의 실시간 이슈 검색어를 추출해 보시오.
네이버 사이트에서 코스피 실시간 지수를 출력하시오.
네이버 환율 사이트에서 엔화 현찰 살때 팔때 환율을 출력하시오. iframe으로 연결되어 있어서 사이트 주소를 정확히 입력해야 한다.

셀레늄(Selenium)

Selenium은 웹 브라우저의 기능을 하도록 하는 모듈이다. 브라우저를 직접 실행하지 않고 selenium 메소드들을 이용해서 웹 브라우저 기능을 대신할 수 있게 한다. Selenium은 Selenium 2(Selenium WebDriver), Selenium 1(Selenium RC), Selenium IDE, Selenium-Grid 툴로 이루어 졌다. 우리가 사용하는 것은 Selenium 2(Selenium WebDriver)이다. 이것은 프로그래밍 언어(Java, C#, Python, Javascript등)에 맞는 인터페이스를 제공하여 프로그래밍을 이용하여 사용하기 편리하다. Selenium 2를 이용하기 위해서는 웹 브라우저에 맞는 드라이버를 다운로드 해야 한다. 여기서는 웹 브라우저가 없이도 사용할 수 있는 PhantomJS를 이용한다. 파이썬에서 사용하는 selenium에 대한 문서는 http://selenium-python.readthedocs.io/index.html을 참고한다. 더 자세한 사용법은 Selenium 파이썬 웹드라이버 API를 참조하자.

설치

아나콘다 프롬프트 창에서 다음과 같이 입력하여 셀레늄을 설치한다.

1
pip install selenium

드라이버 다운로드

phantomjs 드라이버를 인터넷으로부터 다운받아 작업 디렉토리 아래 drivers 폴더에 넣는다. 다운로드하는데 약간의 시간이 걸린다.

1
2
3
4
5
6
7
8
9
10
11
12
13
import urllib.request
import os

directory = 'drivers'
if not os.path.exists(directory):
    os.makedirs(directory)

url = 'https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-windows.zip'

fpath = directory + '/' + 'phantomjs-2.1.1-windows.zip'

if not os.path.exists(fpath):
    urllib.request.urlretrieve(url, fpath)

압축해제

다운받은 파일을 압축해제한다.

1
2
3
4
import zipfile
zip_ref = zipfile.ZipFile(fpath, 'r')
zip_ref.extractall(directory)
zip_ref.close()

압축해제된 경로 연결

1
2
3
4
filename = os.path.split(url)[1] # 파일 이름 추출
file_ext = os.path.splitext(filename) # 파일이름과 확장자로 분리

phantom_path = directory + '/' + file_ext[0] + '/bin/phantomjs.exe'

간단한 사용법

먼저 driver를 설정한다. 드라이버는 웹 브라우저에 해당하는 것이라고 생각할 수 있다. 드라이버는 브라우저의 종류에 따라 설치되어 있어야 한다. 위에서 PhantomJS 드라이버를 설치했다. 드라이버 연결할 때는 드라이버의 위치를 알려주는 방법과 운영체제의 경로에 있으면 된다. 아래는 크롬 브라우저를 이용해서 접근하기 위해 크롬 드라이버를 사용했다. 크롬 드라이버를 링크된 사이트에서 최신 버전으로 다운받아 drivers 디렉토리에 압축해제 시킨다. 그러면 chromedriver.exe 파일이 생긴다. 이것을 이용해 아래와 같이 사용할 수 있다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_path = 'drivers/chromedriver.exe'

assert os.path.exists(chrome_path)
driver = webdriver.Chrome(chrome_path)
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

위의 것을 실행하면 새창에서 크롬 브라우저가 뜨고 파이썬 홈페이지에 접속해서 pycon을 검색한 후 자동으로 종료된다.

크롬 드라이버 위치를 지정한다.

1
chrome_path = 'drivers/chromedriver.exe'

지정된 경로가 올바르면 통과하고 그렇지 않으면 예외를 발생시키고 프로그램이 종료된다.

1
assert os.path.exists(chrome_path)

크롬 드라이버를 이용해 웹드라이버 인스턴스를 만든다.

1
driver = webdriver.Chrome(chrome_path)

get() 메소드를 이용해 사이트에 접속한다.

1
driver.get("http://www.python.org")

접속한 페이지 제목에 Python이 있는지 확인한다.

1
assert "Python" in driver.title

웹드라이버는 find_element_by_* 메소드를 이용해 성분을 찾아낼 수 있다. name 속성을 가진 input 성분은 find_element_by_name 메소드를 이용해서 찾을 수 있다. 아래는 속성이 name이고 속성값이 q인 성분을 찾아낸다.

1
elem = driver.find_element_by_name("q")

clear()는 텍스트가 있으면 없앤다. send_keys() 메소드는 텍스트 입력란에 원하는 텍스트를 입력하는 것이다. Keys.Return은 엔터키를 치는 것과 같다.

1
2
3
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

close() 메소드는 현재 탭을 닫는다. quit() 메소드는 현재 창을 닫는다.

1
driver.close()

직접하기

크롬 드라이버를 다운로드받아 압축해제해서 위 프로그램을 실행하시오.

사이트에서 원하는 자료 가져오기

셀레늄을 이용해서 행정안전부 지방물가정보 사이트에 있는 2017년 10월 농축산물 평균가격을 가져와보자. 페이지 소스 보기를 하면 웹 페이지 상에 보이던 표가 보이지 않는 것을 알 수 있다. 이것은 표를 보여주는 부분이 iframe으로 처리되었기 때문이다. iframe은 inline frame으로 다른 위치에 있는 웹 페이지를 현재 위치에 보이게 하는 것이다. 따라서 iframe 위치로 이동하는 것이 필요하다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pandas as pd

# PhantomJS 드라이버 설정
driver = webdriver.PhantomJS(executable_path=phantom_path)

# 사이트에서 웹 문서 수집
site = 'http://www.mois.go.kr/frt/sub/a02/farmProductPriceList/screen.do'
driver.get(site)

# iframe 으로 이동
iframe = driver.find_element_by_css_selector('iframe')
driver.switch_to.frame(iframe)

# 2017년 10월 클릭
elem = driver.find_element_by_id('year')
select = Select(elem)
select.select_by_value("2017")
elem = driver.find_element_by_id('month')
select = Select(elem)
select.select_by_value("10")

driver.find_element_by_id('srch').click()
html = driver.page_source

driver.close()

bs = bs4.BeautifulSoup(html, 'html.parser')
tables = bs.select('div > table > tbody')
rows = tables[0].find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    print(cols)

['서울', '9,864', '2,619', '6,353', '2,281', '3,330', '1,769', '3,622', '3,329', '11,153', '48,638']
['부산', '9,382', '2,370', '5,962', '3,497', '5,046', '1,961', '2,929', '3,329', '9,601', '43,952']
['대구', '9,428', '2,455', '6,423', '2,447', '4,345', '1,691', '3,428', '3,071', '9,373', '41,517']
['인천', '8,691', '2,438', '5,652', '2,179', '3,797', '1,622', '3,761', '2,783', '9,317', '42,322']
['광주', '10,433', '2,447', '5,219', '2,346', '4,240', '2,061', '3,730', '3,382', '9,868', '43,047']
['대전', '8,013', '2,599', '5,091', '3,169', '2,905', '1,589', '3,876', '4,131', '9,329', '42,780']
['울산', '9,160', '2,196', '6,485', '2,587', '4,608', '1,935', '3,581', '3,957', '10,523', '42,880']
['경기', '9,627', '2,389', '5,917', '2,542', '4,263', '1,860', '3,713', '3,304', '9,848', '44,653']
['강원', '9,251', '2,410', '5,627', '2,208', '4,249', '1,871', '3,186', '3,445', '8,535', '45,319']
['충북', '9,851', '2,643', '5,550', '2,908', '4,044', '2,019', '3,132', '3,279', '9,297', '42,971']
['충남', '8,258', '2,146', '5,626', '2,400', '4,161', '1,914', '3,008', '3,784', '8,169', '44,093']
['전북', '8,892', '2,196', '5,353', '2,954', '3,270', '1,857', '3,557', '3,509', '8,580', '42,440']
['전남', '8,716', '2,098', '5,391', '3,072', '4,268', '2,188', '3,217', '3,210', '7,373', '40,663']
['경북', '7,675', '2,136', '5,941', '2,953', '4,611', '2,032', '3,006', '2,588', '9,240', '41,505']
['경남', '8,795', '2,165', '5,327', '2,960', '3,904', '1,917', '3,189', '3,027', '8,056', '41,187']
['제주', '8,425', '2,528', '6,298', '2,860', '5,522', '2,070', '2,910', '3,087', '8,152', '43,017']

직접하기

행정안전부 지방물가정보 사이트의 지방 공공 요금 페이지에서 2016년 1월 평균요금을 출력하시오.
평균요금을 숫자로 바꾸시오.
고려대 세종 캠퍼스 홈페이지에 있는 셔틀버스 시간표를 출력하시오.
네이버 로그인을 해서 이메일 제목을 출력하시오.

분석

pandas

pandas 모듈은 데이터를 다루기 편리한 메소드들을 제공한다.

생성

엑셀 데이터 읽기

1
2
3
import pandas as pd

df_excel = pd.read_excel('http://qrc.depaul.edu/Excel_Files/Presidents.xls'); df_excel

	President	Years in office	Year first inaugurated	Age at inauguration	State elected from	# of electoral votes	# of popular votes	National total votes	Total electoral votes	Rating points	Political Party	Occupation	College	% electoral	% popular
0	George Washington	8	1789	57	Virginia	69	NA()	NA()	69	842.0	None	Planter	None	100.000000	NA()
1	John Adams	4	1797	61	Massachusetts	132	NA()	NA()	139	598.0	Federalist	Lawyer	Harvard	94.964029	NA()
2	Thomas Jefferson	8	1801	57	Virginia	73	NA()	NA()	137	711.0	Democratic-Republican	Planter, Lawyer	William and Mary	53.284672	NA()
3	James Madison	8	1809	57	Virginia	122	NA()	NA()	176	567.0	Democratic-Republican	Lawyer	Princeton	69.318182	NA()
4	James Monroe	8	1817	58	Virginia	183	NA()	NA()	221	602.0	Democratic-Republican	Lawyer	William and Mary	82.805430	NA()
5	John Quincy Adams	4	1825	57	Massachusetts	84	NA()	NA()	261	564.0	Democratic-Republican	Lawyer	Harvard	32.183908	NA()
6	Andrew Jackson	8	1829	61	Tennessee	178	642553	1148018	261	632.0	Democrat	Lawyer	None	68.199234	55.9706
7	Martin Van Buren	4	1837	54	New York	170	764176	1503534	294	429.0	Democrat	Lawyer	None	57.823129	50.8253
8	William Henry Harrison	0.8	1841	68	Ohio	234	1275390	2411808	294	329.0	Whig	Soldier	Hampden-Sydney	79.591837	52.8811
9	James K. Polk	4	1845	49	Tennessee	170	1339494	2703659	275	632.0	Democrat	Lawyer	U. of North Carolina	61.818182	49.5437
10	Zachary Taylor	1	1849	64	Louisiana	163	1361393	2879184	290	447.0	Whig	Soldier	None	56.206897	47.284
11	Franklin Pierce	4	1853	48	New Hampshire	254	1607510	3161830	296	286.0	Democrat	Lawyer	Bowdoin	85.810811	50.8411
12	James Buchanan	4	1857	65	Pennsylvania	174	1836072	4054647	296	259.0	Democrat	Lawyer	Dickinson	58.783784	45.2832
13	Abraham Lincoln	4	1861	52	Illinois	180	1865908	4685561	303	900.0	Republican	Lawyer	None	59.405941	39.8225
14	Ulysses S. Grant	8	1869	46	Illinois	214	3013650	5722440	294	403.0	Republican	Soldier	US Military Academy	72.789116	52.6637
15	Rutherford B. Hayes	4	1877	54	Ohio	185	4034311	8413101	369	477.0	Republican	Lawyer	Kenyon	50.135501	47.9527
16	James A. Garfield	0.5	1881	49	Ohio	214	4446158	9210420	369	444.0	Republican	Lawyer	Williams	57.994580	48.2731
17	Grover Cleveland	4	1885	47	New York	219	4874621	10049754	401	576.0	Democrat	Lawyer	None	54.613466	48.5049
18	Benjamin Harrison	4	1889	55	Indiana	233	5443892	11383320	401	426.0	Republican	Lawyer	Miami	58.104738	47.8234
19	Grover Cleveland	4	1893	55	New York	277	5551883	12056097	444	576.0	Democrat	Lawyer	None	62.387387	46.0504
20	William McKinley	4	1897	54	Ohio	271	7108480	13935738	447	601.0	Republican	Lawyer	Allegheny College	60.626398	51.009
21	William Howard Taft	4	1909	51	Ohio	321	7676258	14882734	483	491.0	Republican	Lawyer	Yale	66.459627	51.5783
22	Woodrow Wilson	8	1913	56	New Jersey	435	6293152	15040963	531	723.0	Democrat	Educator	Princeton	81.920904	41.8401
23	Warren G. Harding	2	1921	55	Ohio	404	16133314	26753786	531	326.0	Republican	Editor	None	76.082863	60.3029
24	Herbert Hoover	4	1929	54	California	444	21411991	36790364	531	400.0	Republican	Engineer	Stanford	83.615819	58.2
25	Franklin Roosevelt	12	1933	51	New York	472	22825016	39749382	531	876.0	Democrat	Lawyer	Harvard	88.888889	57.4223
26	Dwight D. Eisenhower	8	1953	62	New York	442	33936137	61551118	531	699.0	Republican	Soldier	US Military Academy	83.239171	55.1349
27	John F. Kennedy	3	1961	43	Massachusetts	303	34221344	68828960	537	704.0	Democrat	Author	Harvard	56.424581	49.7194
28	Richard M. Nixon	5	1969	56	New York	301	31785148	73203370	538	477.0	Republican	Lawyer	Whittier	55.947955	43.4203
29	Jimmy Carter	4	1977	52	Georgia	297	40830763	81555889	538	518.0	Democrat	Businessman	US Naval Academy	55.204461	50.0648
30	Ronald Reagan	8	1981	69	California	489	43904153	86515221	538	634.0	Republican	Actor	Eureka College	90.892193	50.7473
31	George Bush	4	1989	64	Texas	426	48886097	91584820	538	548.0	Republican	Businessman	Yale	79.182156	53.3779
32	Bill Clinton	8	1993	46	Arkansas	370	44909326	104425014	538	539.0	Democrat	Lawyer	Georgetown	68.773234	43.0063
33	George W. Bush	8	2001	54	Texas	271	50460110	105417258	538	NaN	Republican	Businessman	Yale	50.371747	47.867
34	Barack Obama	n/a	2009	47	Illinois	365	69492376	129438754	538	NaN	Democrat	Lawyer	Columbia University	67.843866	53.6875

웹페이지 표 읽기(행정안전부 지방 물가 정보)

행정안전부 지방물가 정보 사이트 http://www.mois.go.kr/frt/sub/a02/mulMain/screen.do에서 농축산물 전월 평균 가격정보를 가져오자.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import bs4, requests
import pandas as pd

# 행정 안전부 지방 물가 정보 - 농축산물
site = 'http://www.mois.go.kr/frt/sub/a02/farmProductPriceList/screen.do'

resp = requests.get(site)
html = resp.text
bs = bs4.BeautifulSoup(html, 'html.parser')
iframes = bs.select('iframe')
for iframe in iframes:
    frame_site = iframe.attrs['src']
    dfs = pd.read_html(frame_site, na_values=['-'])
df = dfs[0]
print(df)

    구분    쇠고기  돼지고기   닭고기    달걀    배추     무    감자  고추가루      콩      쌀
0   서울   9864  2619  6353  2281  3330  1769  3622  3329  11153  48638
1   부산   9382  2370  5962  3497  5046  1961  2929  3329   9601  43952
2   대구   9428  2455  6423  2447  4345  1691  3428  3071   9373  41517
3   인천   8691  2438  5652  2179  3797  1622  3761  2783   9317  42322
4   광주  10433  2447  5219  2346  4240  2061  3730  3382   9868  43047
5   대전   8013  2599  5091  3169  2905  1589  3876  4131   9329  42780
6   울산   9160  2196  6485  2587  4608  1935  3581  3957  10523  42880
7   경기   9627  2389  5917  2542  4263  1860  3713  3304   9848  44653
8   강원   9251  2410  5627  2208  4249  1871  3186  3445   8535  45319
9   충북   9851  2643  5550  2908  4044  2019  3132  3279   9297  42971
10  충남   8258  2146  5626  2400  4161  1914  3008  3784   8169  44093
11  전북   8892  2196  5353  2954  3270  1857  3557  3509   8580  42440
12  전남   8716  2098  5391  3072  4268  2188  3217  3210   7373  40663
13  경북   7675  2136  5941  2953  4611  2032  3006  2588   9240  41505
14  경남   8795  2165  5327  2960  3904  1917  3189  3027   8056  41187
15  제주   8425  2528  6298  2860  5522  2070  2910  3087   8152  43017

참고: pd.read_html(웹페이지)은 웹 페이지에 있는 표(table)를 pandas DataFrame 리스트로 변환한다.

직접하기

국가 지표 체계 홈페이지 http://www.index.go.kr/potal/main/EachDtlPageDetail.do?idx_cd=1007에서 지역별 인구 및 인구밀도 표를 출력하시오.
가져온 데이터의 열이름을 df.columns를 이용하여 변경하시오. 예를 들어 2012인구, 2012인구밀도 등으로 바꾸시오.

데이터 프레임 내용 출력

1
df['구분'] # Series 형으로 출력

0     서울
1     부산
2     대구
3     인천
4     광주
5     대전
6     울산
7     경기
8     강원
9     충북
10    충남
11    전북
12    전남
13    경북
14    경남
15    제주
Name: 구분, dtype: object

1
df[['구분', '쇠고기', '감자']] # DataFrame형 출력

	구분	쇠고기	감자
0	서울	9864	3622
1	부산	9382	2929
2	대구	9428	3428
3	인천	8691	3761
4	광주	10433	3730
5	대전	8013	3876
6	울산	9160	3581
7	경기	9627	3713
8	강원	9251	3186
9	충북	9851	3132
10	충남	8258	3008
11	전북	8892	3557
12	전남	8716	3217
13	경북	7675	3006
14	경남	8795	3189
15	제주	8425	2910

직접하기

2012년 인구와 2017년 인구 밀도를 각각 출력하시오.

값 출력

정당에 속해 있는 사람들의 수를 센다.

1
df_excel['Political Party'].value_counts()

Republican               14
Democrat                 13
Democratic-Republican     4
Whig                      2
None                      1
Federalist                1
Name: Political Party, dtype: int64

기본 통계

요약

describe() 함수를 이용하여 기본적인 통계량을 관찰할 수 있다. describe(include='all')을 이용해서 모든 열에 대해서 통계량을 관찰할 수 있다.

1
df_excel.describe()

	Year first inaugurated	Age at inauguration	# of electoral votes	Total electoral votes	Rating points	% electoral
count	35.000000	35.000000	35.000000	35.000000	33.000000	35.000000
mean	1892.542857	55.085714	261.114286	385.085714	552.606061	68.048420
std	64.758530	6.381828	118.620198	143.817567	159.117280	15.092928
min	1789.000000	43.000000	69.000000	69.000000	259.000000	32.183908
25%	1843.000000	51.000000	176.000000	292.000000	444.000000	57.123855
50%	1885.000000	55.000000	234.000000	401.000000	564.000000	66.459627
75%	1943.000000	57.500000	343.000000	531.000000	632.000000	80.756370
max	2009.000000	69.000000	489.000000	538.000000	900.000000	100.000000

1
df_excel.describe(include='all')

	President	Years in office	Year first inaugurated	Age at inauguration	State elected from	# of electoral votes	# of popular votes	National total votes	Total electoral votes	Rating points	Political Party	Occupation	College	% electoral	% popular
count	35	35.0	35.000000	35.000000	35	35.000000	35	35	35.000000	33.000000	35	35	35	35.000000	35
unique	34	10.0	NaN	NaN	15	NaN	30	30	NaN	NaN	6	10	20	NaN	30
top	Grover Cleveland	4.0	NaN	NaN	Ohio	NaN	NA()	NA()	NaN	NaN	Republican	Lawyer	None	NaN	NA()
freq	2	16.0	NaN	NaN	6	NaN	6	6	NaN	NaN	14	21	8	NaN	6
mean	NaN	NaN	1892.542857	55.085714	NaN	261.114286	NaN	NaN	385.085714	552.606061	NaN	NaN	NaN	68.048420	NaN
std	NaN	NaN	64.758530	6.381828	NaN	118.620198	NaN	NaN	143.817567	159.117280	NaN	NaN	NaN	15.092928	NaN
min	NaN	NaN	1789.000000	43.000000	NaN	69.000000	NaN	NaN	69.000000	259.000000	NaN	NaN	NaN	32.183908	NaN
25%	NaN	NaN	1843.000000	51.000000	NaN	176.000000	NaN	NaN	292.000000	444.000000	NaN	NaN	NaN	57.123855	NaN
50%	NaN	NaN	1885.000000	55.000000	NaN	234.000000	NaN	NaN	401.000000	564.000000	NaN	NaN	NaN	66.459627	NaN
75%	NaN	NaN	1943.000000	57.500000	NaN	343.000000	NaN	NaN	531.000000	632.000000	NaN	NaN	NaN	80.756370	NaN
max	NaN	NaN	2009.000000	69.000000	NaN	489.000000	NaN	NaN	538.000000	900.000000	NaN	NaN	NaN	100.000000	NaN

열별 평균, 합계 mean() 함수를 이용하여 평균을 구할 수 있다. 숫자형에 대해서만 계산한다. sum()을 이용하여 열별 합계를 구할 수 있다. 문자열일 경우 각 행의 문자열들을 연결한다.

1
df.mean() # Series형 반환

쇠고기      9028.8125
돼지고기     2364.6875
닭고기      5763.4375
달걀       2710.1875
배추       4160.1875
무        1897.2500
감자       3365.3125
고추가루     3325.9375
콩        9150.8750
쌀       43186.5000
dtype: float64

1
df.sum()

구분      서울부산대구인천광주대전울산경기강원충북충남전북전남경북경남제주
쇠고기                               144461
돼지고기                               37835
닭고기                                92215
달걀                                 43363
배추                                 66563
무                                  30356
감자                                 53845
고추가루                               53215
콩                                 146414
쌀                                 690984
dtype: object

df.iloc[행슬라이싱, 열슬라이싱] 을 이용하여 파이썬 슬라이싱 문법을 사용할 수 있다.

1
df.iloc[:, 1:].sum()

쇠고기     144461
돼지고기     37835
닭고기      92215
달걀       43363
배추       66563
무        30356
감자       53845
고추가루     53215
콩       146414
쌀       690984
dtype: int64

또한 이름으로도 가능한다. df.loc[:, '쇠고기':'닭고기']를 이용하여 쇠고기 열부터 닭고기 열까지를 잘라낼 수 있다.

1
df.loc[:5, '쇠고기':'닭고기']

	쇠고기	돼지고기	닭고기
0	9864	2619	6353
1	9382	2370	5962
2	9428	2455	6423
3	8691	2438	5652
4	10433	2447	5219
5	8013	2599	5091

정렬 sort_values(by=['colname'])을 이용해서 지정된 열로 데이터프레임을 정렬할 수 있다.

1
df.sort_values(by=['쇠고기'])

	구분	쇠고기	돼지고기	닭고기	달걀	배추	무	감자	고추가루	콩	쌀
13	경북	7675	2136	5941	2953	4611	2032	3006	2588	9240	41505
5	대전	8013	2599	5091	3169	2905	1589	3876	4131	9329	42780
10	충남	8258	2146	5626	2400	4161	1914	3008	3784	8169	44093
15	제주	8425	2528	6298	2860	5522	2070	2910	3087	8152	43017
3	인천	8691	2438	5652	2179	3797	1622	3761	2783	9317	42322
12	전남	8716	2098	5391	3072	4268	2188	3217	3210	7373	40663
14	경남	8795	2165	5327	2960	3904	1917	3189	3027	8056	41187
11	전북	8892	2196	5353	2954	3270	1857	3557	3509	8580	42440
6	울산	9160	2196	6485	2587	4608	1935	3581	3957	10523	42880
8	강원	9251	2410	5627	2208	4249	1871	3186	3445	8535	45319
1	부산	9382	2370	5962	3497	5046	1961	2929	3329	9601	43952
2	대구	9428	2455	6423	2447	4345	1691	3428	3071	9373	41517
7	경기	9627	2389	5917	2542	4263	1860	3713	3304	9848	44653
9	충북	9851	2643	5550	2908	4044	2019	3132	3279	9297	42971
0	서울	9864	2619	6353	2281	3330	1769	3622	3329	11153	48638
4	광주	10433	2447	5219	2346	4240	2061	3730	3382	9868	43047

시각화

1
2
3
get_ipython().magic('matplotlib inline')
import matplotlib.pyplot as plt
import numpy as np

line

1
2
3
4
5
6
7
import numpy as np
from matplotlib import font_manager, rc

font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
rc('font', family=font_name)

df.set_index('구분').plot(kind='line', xticks=np.arange(len(df['구분'])), rot=90)

<matplotlib.axes._subplots.AxesSubplot at 0x238e2d606d8>

xticks=np.arange(16)는 xtick이 보여질 위치를 지정하는 것이다.

boxplot

1
2
3
4
5
6
7
get_ipython().magic('matplotlib inline')
from matplotlib import font_manager, rc

font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
rc('font', family=font_name)

df.boxplot()

<matplotlib.axes._subplots.AxesSubplot at 0x238e2e63550>

파이 그래프

1
2
get_ipython().magic('matplotlib inline')
df_excel['Political Party'].value_counts().plot(kind="pie")

<matplotlib.axes._subplots.AxesSubplot at 0x238e30c83c8>

바차트

1
df_excel['Political Party'].value_counts().plot(kind="bar")

<matplotlib.axes._subplots.AxesSubplot at 0x238e319b4e0>

참고 사이트

Beautiful Soup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
requests: http://pythonstudy.xyz/python/article/403-%ED%8C%8C%EC%9D%B4%EC%8D%AC-Web-Scraping
urllib: https://www.acmicpc.net/blog/view/16
Python for Data Analysis by Wes McKinney: https://github.com/wesm/pydata-book
http://www.hanbit.co.kr/channel/category/category_view.html?cms_code=CMS9481416663
Python for Data Analysis by Wes McKinney: https://github.com/wesm/pydata-book
Pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/

문서데이터

문서 데이터

HTTP

요청(request)

요청 메소드

요청 URI

요청 헤더

응답(response)

데이터 수집 및 파싱(Parsing)

웹 페이지 읽기

다음(daum) 홈페이지 출력

구글 검색 결과 출력

파싱(Beautiful Soup)

설치

BeautifulSoup 웹페이지 파싱

네이버 금융 사이트에서 헤드라인 뉴스 제목 발췌

셀레늄(Selenium)

설치

드라이버 다운로드

압축해제

압축해제된 경로 연결

간단한 사용법

사이트에서 원하는 자료 가져오기

분석

pandas

생성

기본 통계

시각화

참고 사이트

results matching ""

No results matching ""