Date : 11-09-05 05:38
HTML HTTP / How to Prevent Your Site or Certain Subdirectories From Being Crawled
 Author : 07 (141.♡.211.23)
Views : 9,249   recommend : 0  
   robots.txt (112byte) [0] DATE : 2011-09-05 05:54:10
   http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-02.html [3055]
How to Prevent Your Site or Certain Subdirectories From Being Crawled
Last Updated: March 23, 2011 Text Size: A A A


The crawler obeys the Robot Exclusion Standard. Specifically, he crawler adheres to the 1996 Robots Exclusion Standard (RES).
The crawler obeys the first entry in the robots.txt file.
Disallowed documents, including slash "/" (the home page of the site), are not crawled, nor are links in those documents followed. The crawler does read the home page at each site and uses it internally, but if it is disallowed, it is neither indexed nor followed. If a page has robots.txt standards disallowing it to be crawled, the crawler will not read or use the contents of that page.
Example robots.txt:
User-agent: *
Disallow: /cgi-bin/

Directives are Case Sensitive
Robots directives for Disallow/Allow are case sensitive. Use the correct capitalization to match your website:
Example of capitalization:
User-agent: *
Disallow: /private
Disallow: /Private
Disallow: /PRIVATE

Additional Symbols
Additional symbols allowed in the robots.txt directives include:
'*' - matches a sequence of characters
'$' - anchors at the end of the URL string
Using Wildcard Match: '[b]*'[/b]
A '*' in robots directives is used to wildcard match a sequence of characters in your URL. You can use this symbol in any part of the URL string that you provide in the robots directive.
Example of '**':
User-agent: *
Allow: /public*/
Disallow: /*_print*.html
Disallow: /*?sessionid

The robots directives above:
Allow all directories that begin with "public" to be crawled.
Example: /public_html/ or /public_graphs/
Disallow files or directories which contain "_print" to be crawled.
Example: /card_print.html or /store_print/product.html
Disallow files with "?sessionid" in their URL string to be crawled.
Example: /cart.php?sessionid=342bca31

Note: A trailing '*' is not needed since that is the matching behavior for the crawler.
In the example below, both 'Disallow' directives are equivalent:
User-agent: *
Disallow: /private*
Disallow: /private

Using '$'
A '$' in robots directives is used to anchor the match to the end of the URL string. Without this symbol, the crawler would match all URLs against the directives, treating the directives as a prefix.
Example of '$':
User-agent: *
Disallow: /*.gif$
Allow: /*?$

The robots directives above:
Disallow all files ending in '.gif' in your entire site.
Note: Omitting the '$' would disallow all files containing '.gif' in their file path.
Allow all files ending in '?' to be included. This would not allow files that just contain '?' somewhere in the URL string.

Note: The '$' symbol only makes sense at the end of the string. Hence, when the crawler encounters a '$' symbol, it assumes the directive terminates there and any characters after that symbol are ignored.
Using Allow:
The 'Allow' tag is supported as shown in the examples above.
For additional details see:


검색엔진 회사가 사용하는 웹 크롤러(Web crawler)는 웹 상의 여러 문서들을 긁어서 문서를 적당한 형태로 저장합니다. 이 때 keywords와 description 등이나 문서 길이, url, 제목 등의 메타 정보를 따로 관리하며 저장하지만 (x)html 문서 자체도 모두 긁어서 저장합니다.

문제는 웹에 이런 웹 크롤러(web crawler)들이 아주 많이 떠돌아 다니고 있어서 외부에 노출하고 싶지 않은 문서까지 공개되고 심지어 저장될 수 있다는 사실입니다. 그것을 해결하는 것이 robots라는 메타 태그입니다.

<meta name="robots" content="index,follow" /> : 이 문서도 긁어가고 링크된 문서도 긁어감. 
<meta name="robots" content="noindex,follow" /> : 이 문서는 긁어가지 말고 링크된 문서만 긁어감.
<meta name="robots" content="index,nofollow" /> : 이 문서는 긁어가되, 링크는 무시함.
<meta name="robots" content="noindex,nofollow" /> : 이 문서도 긁지 않고, 링크도 무시함.


sample
-------------------

User-agent: *
Disallow: /fig/
Disallow: /en/
Disallow: /gnuboard/
Disallow: /ko/
Allow: /mtu/
Allow: /

 
 

Total 11
No Topic Author Date Views Recommend
11 Algorithm Cross Correlation, AutoCorrelationby Paul Bourke 07 09-07 22700 0
10 Visual Basic-FORTRAN Mixed Programming 07 06-29 9265 0
9 HTML HTTP / How to Prevent Your Site or Certain Subdirectori… 07 09-05 9250 0
8 DIGITAL Fortran 90 User Manual for DIGITAL UNIX Systems 07 11-11 7280 0
7 C / FORTRAN / John Burkardt CODES EXAMPLE......SITE LINK 07 09-07 6666 0
6 Fortran-GUI- Xeffort library. 07 06-29 6166 0
5    google map api - 3d insert examples 07 09-13 5410 0
4 Javascript, html, php : access though 07 09-05 5392 0
3 High Performance Parallel Computing Training 07 06-29 5322 0
2 google map api - 3d insert examples 07 09-13 5231 0
1 android - developers code site 07 09-08 5146 0

Warning: Unknown: open(../data/session/sess_5af58a15c3c9d65484adacbc8075d423, O_RDWR) failed: No such file or directory (2) in Unknown on line 0

Warning: Unknown: Failed to write session data (files). Please verify that the current setting of session.save_path is correct (../data/session) in Unknown on line 0