wget

팁이라기 보다는 문제 제기에 가깝습니다.

리눅스에서 wget의 용도는 참 다양합니다.
다른 홈페이지의 뭔가를 가져오고(때로는 긁어오고...) 제 경우는 내부에서 php를 실행할때로 자주 사용합니다.

근데 man wget 하고 여러 옵션을 구경하다가
-r
--recursive
라는 놈이 있어서 한번 실행을 해봤습니다.
아예 홈페이지의 내용을 디렉토리채 긁어오는군요.

테스트해보시려면
wget http://www.phpschool.com -r 해보십시요.

www.php.school.com 이라는 폴더가 하나 생기고 그 하위 폴더들이 주루룩 나오는군요.
디렉토리 스트럭처가 적나라하게 보입니다.
banner community guild index.html survey
biznbaza company html_sub index.php title_image
class gnuboard4 images menu_images ttrend

물론 php 소스가 아니고 출력된 html의 형태로 저장되기는 하지만 기타 css, js, 이미지등등은 그대로 가져오는군요.

저만 모르고 있었던 것인가요?
뒷북이라면 죄송합니다만....

여기서 하나...
남의 홈페이지 긁어오는 데는 그만이지만
반대로 누군가 내 홈페이지를 긁어가는 것은 기분이 엄청 나쁠 것같습니다.

해서~~
아무리 생각해봐도 wget을 막기는 막아야할 것같습니다만..
이놈이
-U agent-string
--user-agent=agent-string
옵션까지 무장을 하고 있어서 agent로 차단하는 것은 불가능해보입니다.

혹씨나해서 살펴보니 robots.txt가 먹히는군요.

User-agent:*
Disallow:/

하니까 못 긁어가네요.

더좋은 방법이 있는지 모르겠습니만...
참고하시기를...

====================

오늘도 wget 여러 옵션들을 테스트해봤습니다.
그동안 제일 궁금했던 것이
--user-agent=agent-string
등으로 user-agent를 속이는데 서버에서는 어떻게 인식하는지가 제일 궁금해서
http://browsers.garykeith.com/tools/your-browser.asp
에 접속해봤습니다.

wget --user-agent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" http://browsers.garykeith.com/tools/your-browser.asp
했더니 튕기더군요.
----------------------------------
Access Denied
You do not appear to be using this form in accordance with my Terms of Use.
Continued abuse will eventually result in you losing access to this server!
It's also possible you are using security software that modifies the HTTP_REFERER header.
If I can't confirm the referrer is valid then you can't have access to my forms.
--------------------------------

그래서 이번에는 --referer 옵션까지 설정을 해서 다시 시도해봤습니다.
wget --user-agent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" --referer="http://browsers.garykeith.com/tools/property-docs.asp" http://browsers.garykeith.com/tools/your-browser.asp
이번에는 성공!!!
-------------------------------
Your Browser
User Agent
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
.... 이하 생략
--------------------------------

어찌보면 당연한 결론이지만 --user-agent, --referer 두가지다 제대로 작동합니다.

더 무서븐것은 --save-cookie, --load-cookies, --post-data=string, --post-file=file 입니다.

--save-cookie "cookie.txt" --post-data "user_id=myid&password=mypassword"
옵션을 주니까 멋있게 로그인까지 하고 쿠키를 저장해두더군요.

다시
--load-cookies "cookie.txt" 하니까 회원만 가능한 곳 어디든지 정상접속됩니다.

post.dat라는 화일에 게시판에 필요한 변수를 입력해두고
--load-cookies "cookie.txt" --post-file "post.dat" 하니까 게시판에 글도 씁니다.

잘만 활용하면 좋겠는데
까딱하면 스팸로봇이 따로 필요없습니다...헐~~

출처

http://phpschool.com/gnuboard4/bbs/board.php?bo_table=tipntech&wr_id=62537&page=1

http://www.phpschool.com/gnuboard4/bbs/board.php?bo_table=tipntech&wr_id=62631&page=1

[출처] [펌] wget을 이용한 웹긁어오기

|작성자

사랑굳

http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/

wget utility is the best option to download files from internet. wget can pretty much handle all complex download situations including large file downloads, recursive downloads, non-interactive downloads, multiple file downloads etc.,

In this article let us review how to use wget for various download scenarios using 15 awesome wget examples.

1. Download Single File with wget

The following example downloads a single file from internet and stores in the current directory.

$ wget http://www.openss7.org/repos/tarballs/strx25-0.9.2.1.tar.bz2

While downloading it will show a progress bar with the following information:

%age of download completion (for e.g. 31% as shown below)
Total amount of bytes downloaded so far (for e.g. 1,213,592 bytes as shown below)
Current download speed (for e.g. 68.2K/s as shown below)
Remaining time to download (for e.g. eta 34 seconds as shown below)

Download in progress:

$ wget http://www.openss7.org/repos/tarballs/strx25-0.9.2.1.tar.bz2
Saving to: `strx25-0.9.2.1.tar.bz2.1'

31% [=================> 1,213,592   68.2K/s  eta 34s

Download completed:
$ wget http://www.openss7.org/repos/tarballs/strx25-0.9.2.1.tar.bz2
Saving to: `strx25-0.9.2.1.tar.bz2'

100%[======================>] 3,852,374 76.8K/s in 55s

2009-09-25 11:15:30 (68.7 KB/s) - `strx25-0.9.2.1.tar.bz2' saved [3852374/3852374]

2. Download and Store With a Different File name Using wget -O

By default wget will pick the filename from the last word after last forward slash, which may not be appropriate always.

Wrong: Following example will download and store the file with name: download_script.php?src_id=7701

$ wget http://www.vim.org/scripts/download_script.php?src_id=7701

Even though the downloaded file is in zip format, it will get stored in the file as shown below.

$ ls
download_script.php?src_id=7701

Correct: To correct this issue, we can specify the output file name using the -O option as:

$ wget -O taglist.zip http://www.vim.org/scripts/download_script.php?src_id=7701

3. Specify Download Speed / Download Rate Using wget –limit-rate

While executing the wget, by default it will try to occupy full possible bandwidth. This might not be acceptable when you are downloading huge files on production servers. So, to avoid that we can limit the download speed using the –limit-rate as shown below.

In the following example, the download speed is limited to 200k

$ wget --limit-rate=200k http://www.openss7.org/repos/tarballs/strx25-0.9.2.1.tar.bz2

4. Continue the Incomplete Download Using wget -c

Restart a download which got stopped in the middle using wget -c option as shown below.

$ wget -c http://www.openss7.org/repos/tarballs/strx25-0.9.2.1.tar.bz2

This is very helpful when you have initiated a very big file download which got interrupted in the middle. Instead of starting the whole download again, you can start the download from where it got interrupted using option -c

Note: If a download is stopped in middle, when you restart the download again without the option -c, wget will append .1 to the filename automatically as a file with the previous name already exist. If a file with .1 already exist, it will download the file with .2 at the end.

5. Download in the Background Using wget -b

For a huge download, put the download in background using wget option -b as shown below.

$ wget -b http://www.openss7.org/repos/tarballs/strx25-0.9.2.1.tar.bz2
Continuing in background, pid 1984.
Output will be written to `wget-log'.

It will initiate the download and gives back the shell prompt to you. You can always check the status of the download using tail -f as shown below.

$ tail -f wget-log
Saving to: `strx25-0.9.2.1.tar.bz2.4'

     0K .......... .......... .......... .......... ..........  1% 65.5K 57s
    50K .......... .......... .......... .......... ..........  2% 85.9K 49s
   100K .......... .......... .......... .......... ..........  3% 83.3K 47s
   150K .......... .......... .......... .......... ..........  5% 86.6K 45s
   200K .......... .......... .......... .......... ..........  6% 33.9K 56s
   250K .......... .......... .......... .......... ..........  7%  182M 46s
   300K .......... .......... .......... .......... ..........  9% 57.9K 47s

Also, make sure to review our previous multitail article on how to use tail command effectively to view multiple files.

6. Mask User Agent and Display wget like Browser Using wget –user-agent

Some websites can disallow you to download its page by identifying that the user agent is not a browser. So you can mask the user agent by using –user-agent options and show wget like a browser as shown below.

$ wget --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" URL-TO-DOWNLOAD

7. Test Download URL Using wget –spider

When you are going to do scheduled download, you should check whether download will happen fine or not at scheduled time. To do so, copy the line exactly from the schedule, and then add –spider option to check.

$ wget --spider DOWNLOAD-URL

If the URL given is correct, it will say

$ wget --spider download-url
Spider mode enabled. Check if remote file exists.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain further links,
but recursion is disabled -- not retrieving.

This ensures that the downloading will get success at the scheduled time. But when you had give a wrong URL, you will get the following error.

$ wget --spider download-url
Spider mode enabled. Check if remote file exists.
HTTP request sent, awaiting response... 404 Not Found
Remote file does not exist -- broken link!!!

You can use the spider option under following scenarios:

Check before scheduling a download.
Monitoring whether a website is available or not at certain intervals.
Check a list of pages from your bookmark, and find out which pages are still exists.

8. Increase Total Number of Retry Attempts Using wget –tries

If the internet connection has problem, and if the download file is large there is a chance of failures in the download. By default wget retries 20 times to make the download successful.

If needed, you can increase retry attempts using –tries option as shown below.

$ wget --tries=75 DOWNLOAD-URL

9. Download Multiple Files / URLs Using Wget -i

First, store all the download files or URLs in a text file as:

$ cat > download-file-list.txt
URL1
URL2
URL3
URL4

Next, give the download-file-list.txt as argument to wget using -i option as shown below.

$ wget -i download-file-list.txt

10. Download a Full Website Using wget –mirror

Following is the command line which you want to execute when you want to download a full website and made available for local viewing.

$ wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

–mirror : turn on options suitable for mirroring.
-p : download all files that are necessary to properly display a given HTML page.
–convert-links : after the download, convert the links in document for local viewing.
-P ./LOCAL-DIR : save all the files and directories to the specified directory.

11. Reject Certain File Types while Downloading Using wget –reject

You have found a website which is useful, but don’t want to download the images you can specify the following.

$ wget --reject=gif WEBSITE-TO-BE-DOWNLOADED

12. Log messages to a log file instead of stderr Using wget -o

When you wanted the log to be redirected to a log file instead of the terminal.

$ wget -o download.log DOWNLOAD-URL

13. Quit Downloading When it Exceeds Certain Size Using wget -Q

When you want to stop download when it crosses 5 MB you can use the following wget command line.

$ wget -Q5m -i FILE-WHICH-HAS-URLS

Note: This quota will not get effect when you do a download a single URL. That is irrespective of the quota size everything will get downloaded when you specify a single file. This quota is applicable only for recursive downloads.

14. Download Only Certain File Types Using wget -r -A

You can use this under following situations:

Download all images from a website
Download all videos from a website
Download all PDF files from a website

$ wget -r -A.pdf http://url-to-webpage-with-pdfs/

15. FTP Download With wget

You can use wget to perform FTP download as shown below.

Anonymous FTP download using Wget

$ wget ftp-url

FTP download using wget with username and password authentication.

$ wget --ftp-user=USERNAME --ftp-password=PASSWORD DOWNLOAD-URL

If you liked this article, please bookmark it with delicious or Stumble.

1. Introduction

가끔씩 이 홈페이지는 자주 참고하는데 이를 내 컴퓨터에 옮겨놓고 싶다고 생각되는 경우가 있나? 이때 사용 할 수 있는 것이 바로 지금 소개하는 'wget'이다.하는 일은 MS 윈도우즈의 'Teleport Pro'라는 것과 비슷하다. 이 wget은 한텀상에서 커맨드라인으로 사용하는 것이다.

이 프로그램은 http://www.gun.org/software/wget/wget.html 에서 구할수 있다.

2. 설치(일반적으로 리눅스에는 거의 기본으로 깔려 있음)

#tar xvzf wget-5.3.1.tar.gz
#./configure
#make install

이라 하면 wget-5.3.1/src 디렉토리 내에 'wget'이라는 실행파일이 만들어 진다. 패스가 열려있는 디렉토리에 심볼릭 링크를 만들던가 쉘 스크립트를 만들어 넣어두면 어느 디렉토리에서도 이것을 실행시킬 수 있다.

3. 사용법

#wget -h 또는 # man wget

하면 좀더 많은 설명을 볼수 있다. 먼저 예를 들어서 홈페이지 www.ihelpers.co.kr을 복사해 온다고 하면

#wget http://www.ihelpers.co.kr/index.html

이라고 명령하면 된다. 'http://"는 생략해도 되고, 'index.html' 부분도 생략하면 알아서 'index.html' 또는 'index.htm'을 찾으므로

#wget www.ihelpers.co.kr/

라고 해도된다. 그러면 명령을 내린 현재 디렉토리에 'index.html'을 복사한다. 자 그러면 index.html에 링크 되어 있는 것까지 찾아오려면 어떻게 할까?

-r 이란 옵션을 사용한다.

#wget -r www.ihelpers.co.kr/

이렇게 사용한다. 'r'은 '재귀적'을 의미하는 'recursive'의 줄임이다. 디렉토리 구조와 파일을 그대로 복사해 온다.

재귀적 탐색의 깊이 레벨은 기본값이 '5'로 되어 있다. '-l depth' 옵션을 사용하면 바꿀수 있다. 그러니까 기본값은 '-l 5' 란 이야기다. 이 레벨이 증가함에 따라 복사해오는 파일은 지수 함수적으로 증가하게 된다.

만약 홈페이지 전체가 아니라 특정 페이지와 링크되어 있는 것만 받아 오려면 어떻게 할까? 예를 들어

#wget -r www.ihelpers.co.kr/doc/lecture/lecture.html

이라고 하면 'lecture.html'과 이것에 링크된 파일들을 복사해 오게 된다. 이 과정에서 링크된 파일 중 상위 디렉토리에 있는 것도 있는데, 이것들은 빼고 원하는 파일이 있는 디렉토리와 그 이하에서만 받아 오려면

#wget -r -np www.ihelpers.co.kr/doc/lecture/lecture.html

이렇게 '-np' 옵션을 사용한다. 'np'는 'no-parent'를 의미한다.

자, 홈페이지에서 그림이나 오디오파일 등은 빼고 HTML 문서만 복사해오려고 한다면

#wget -r -np -A html, htm www.ihelpers.co.kr/

처럼 -'A'옵션을 사용한다. 'A'는 'accept'를 의미한다. 위의 예처럼 원하는 파일의 형식을 나열하면된다. 반대로 일부 파일을 제외하고 싶으면 '-R'옵션을 사용한다. 'R'은 'reject'를 의미한다. 예를 들어

#wget -r -R gif,jpg,jpeg www.ihelpers.co.kr/

라고 하면 파일의 뒷부분의 문자열이 'gif', 'jpg', 'jpeg'인 파일들은 제외하고 나머지 파일들만 불러 온다.

'-L'옵션을 사용하면 상대주소를 이용한 링크들만 불러온다. 다른 호스트에 있는 자료는 당연히 불러오지 않는다. 그런데 대부분의 HTML 내의 링크는 상대주소를 이용할 것으로 생각되기 때문에 크게 중요하지는 않을 것 같다. 'L'은 'leLative'를 의미한다. 아래의 예처럼 사용할 수 있다.

#wget -r -L www.ihelpers.co.kr/

만약 다른 호스트의 자료까지 재귀적 탐색에 넣으려면 'H'옵션을 쓴다. 이는 'span-hosts'를 의미한다.

wget을 실행시켜 보면 메세지가 장황하게 나온다. 메세지를 전혀 나오지 않게 하려면 'quiet'를 의미하는 '-q'옵션을 조금만 나오게 하려면 'non-verbose'를 의미하는 '-nv'옵션을 사용해라

wget은 파일을 불러올 때 컴퓨터 내에 같은 이름의 파일이 존재하면 기존의 파일은 그대로 두고 기존의 이름의 끝에 'roiginal.file.1', 'original.file.2', 이런 식으로 숫자를 붙인 이름으로 복사한다. 만약, 같은 이름이, 파일이 있을 경우 복사해 오지 않게 하려면 'not-clobber'를 위미하는 '-nc' 옵션을 사용하면 된다.

매번 같은 옵션을 커맨드라인에서 사용하기는 번거로울 것이다. 자신의 홈디렉토리에다 '.wgetrc'라는 파일을 만들고 필요한 옵션들을 기록하면 매번 옵션을 주지 않아도 된다.

예를 들면,

accept = htm, html 원하는 파일의 형식을 나열
#reject = 배제하기를 원하는 파일의 형식을 나열
recursive = on 재귀적 탐색 여부
#reclevel = 5 재귀적 탐색의 깊이 레벨
no_parent = on 상위 디렉토리의 파일 배제의 여부
#relative_only = on 상대주소만 포함시킬 것인지의 여부
#verbose = on/off 자세한 설명을 표시할 것인지의 여부
span_host = on/off 다른 호스트도 탐색할 것인지의 여부
#noclobber = on/off 같은 이름의 파일을 복사해 오지 않을 것인지의 여부
#quiet = on/off 메세지를 전혀 보내지 않을 것인지의 여부

4. discussion(토론)

위에서 설명한 옵션을 잘 활용하면 홈페이지에 갱신되는 내용을 웹브라우저로 접근하지 않고서도 가져올 수 있다. 예를 들어, 만약 내가 잘 가는 무료 porn 사이트에 링크된 url에 올려져 있는 동영상을 보고 싶다고 하자. 다음과 같이 wget 명령을 사용한다.

#wget -A mpg, mpeg, avi, asf -r -H -l 2 -nd -t 1 http://url.you.like

-A, -r, -H는 위에서 설명을 했고, 설명이 안된 옵션을 살펴보면
-l 2 : Recursive depth. 재귀 검색의 깊이 레벨을 지정해 준다. 2로 지정해 주면, 메인 페이지에서 url을 따라가고 그 url에 링크된 비디오 파일을 가져오도록 한다.
-nd : No directoryes. 로칼에 다운받을 때 디렉토리를 생성하지 않고 모든 파일을 같은 디렉토리 안에 넣는다.

출처 - http://blog.naver.com/soonhg/40024507911

출처 : Tong - 라제폰님의 시스템 / 서버구축통

저작자표시 비영리 변경금지 (새창열림)

티스토리툴바