In a recent bout of stupidity, the U.S. Department of Energy apparently
accidentally published confidential Homeland Security Department
documents marked "For Official Use Only", and the documents remain
visible via Google's Web cache.
To avoid situations like this, be sure you've created a properly
configured robots.txt file on your Web servers. While it won't prevent
confidential documents from being placed on a publicly available
server, it is at least one way to prevent such documents from being
available in Google's Web cache from now until eternity.
The robots.txt file isn't based on any officially recognized standard, but it has been in existence since 1993 and is generally accepted. Full details can be found here.
The robots.txt file is placed on a Web server to provide instructions to well-behaved Web crawlers or spiders. Anyone can use a crawler, but they're most often used by search engines to collect information about Web sites. The file's role is to provide instructions to the crawler, specifying what directories or files should not be indexed by the crawler. There are basically two lines:
These lines can be repeated within the same file. The "User-Agent:"
line indicates which crawler type the subsequent "Disallow:" lines
apply to. You can specify a particular crawler by indicating its User-
Agent value (found in your Web logs), or simply specify "*" to indicate
Following the "User-Agent:" line are one or more "Disallow:" lines,
typically indicating directories. Files can also be specified if
desired. Here's a sample robots.txt file:
These two lines, if placed in the robots.txt file at the root of your Web site, tell crawlers to ignore your site.
It's important to understand that a robots.txt file isn't a security
mechanism; it does nothing to prevent crawlers or individuals from
searching your site for files to index or view. Only polite crawlers
will request the file and honor its contents.
If you want some of your site to be found in search engines, but have other files you want to keep out, you should disallow all directories except the ones you want to make available in the search engine. For example, if you have the following structure on your Web root:
"/": Publicly available information to be put into search engines
"/Dev": Stuff you're working on but don't want published
"/Private": Stuff you definitely don't want published
Your robots.txt file would look like this:
To be extra secure, you should put some form of authentication on both the /Dev and /Private sub-directories.
Finally, you might have specified that nothing should be crawled, yet you find crawlers still reading directory pages that should be
inaccessible. This is means there's still a link to a page on your site
somewhere on the Internet.
Using the previous example, let's say you've got a file named FOO.ASP in the /Dev directory. According to your robots.txt file, it shouldn't be crawled. However, there's no defense if some other site offers up a link like this:
Crawlers will follow that link to your FOO.ASP page and include it in their searches. There's nothing you can do about this. That's why
authentication is a necessary extra step to prevent access.
Russ Cooper is a Senior Information Security Analyst with
Cybertrust, Inc., www.cybertrust.com. He's also founder and editor of
NTBugtraq, www.ntbugtraq.com, one of the industry's most influential
mailing lists dedicated to Microsoft security. One of the world's most-
recognized security experts, he's often quoted by major media outlets
on security issues.
Russ Cooper's Security Watch column appears every Monday in the
Redmond magazine/ENT Security Watch e-mail newsletter. Click here to
Russ Cooper is a senior information security analyst with Verizon Business, Inc.
He's also founder and editor of NTBugtraq, www.ntbugtraq.com,
one of the industry's most influential mailing lists dedicated to Microsoft security.
One of the world's most-recognized security experts, he's often quoted by major
media outlets on security issues.