This site requires JavaScript to be enabled

Restricting Access to Web Pages on Apache web servers

17 views

Why bother?

There are a number of reasons to be concerned about access to web pages and other information on the system. One reason is security. If you serve certain kinds of information, you are making your system, and possibly other systems, vulnerable to attack.

Another reason may involve the potential for embarrassment. Certain communications made public may be a surprise to some of the people involved. An example is email that goes to a list that is archived in a web-accessible way where the sender was unaware of the lack of privacy. Another example is references to vendors that should not be made publicly available. You may think it unlikely that your web page will be found, but many people and companies regularly do searches on their names and the names of their products, so if the search engines index your pages, they may be found by an audience you didn't intend.

Ways to Restrict Access

Access to web pages can be controlled in a number of ways:

  • Allow access by Single Sign-On (SSO). We recommend you use SSO to secure your website. SSO installation and usage documentation can be found at http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5685
  • Limit what the search engines can index
  • Control access by IP address
  • Allow access by password only

Note, none of these methods (except SSO)  is very secure with the standard servers. For example, passwords passed without encryption are very insecure. If you have sensitive information to which access must be controlled, you need expert advice.

Limit what search engines index

Since a very important way to find things that shouldn't be found is by use of a search engine, you can control access by the search engines. Of course, these methods don't keep anyone from seeing your pages if they know the URL.

Also note that restricting access or unlinking a file will not remove it from a search engine. Even re-indexing your site may not get this file removed. They can remain for many months. In fact, even dead links remain for a long time (as anyone who uses search engines a lot knows). The only way to be sure a file is no longer available from a search engine if it once was listed is to move (change the URL) or remove it.

Web Robots are programs that explore the web automatically. They are also sometimes called spiders or crawlers. The search engine indexers are web robots.

One way to exclude log files, dynamic pages, and anything else you don't want indexed is by specifying the directories or files in a robots.txt file in your root area. Note that following commands in a robots.txt file is voluntary for search engines and is not enforced. Many search engines such as Google obey robots.txt commands, but others may not. When a compliant indexing robot visits your site, it first look for a file named robots.txt. For example, such a file on the main web server would have the URL: "http://www.fnal.gov/robots.txt ".  Only one robots.txt on a site will be recognized.

You can exclude all robots (*) or specific robots. You can exclude all pages, specific directories, or individual directories. Expressions are not allowed.

The following robots.txt file restricts compliant search agents from searching the directories /logs/ and /private/ and any subdirectories thereof.

#no robots
User-agent: *
Disallow: /logs/
Disallow: /private/

 

Prevent search engines from searching and indexing a site


To prevent search engines from searching and indexing a site, create file /robots.txt

User-agent: *
Disallow: /

 

If you don't have access to the root area of your server, you can specify access on a page-by-page basis using META tags. (Note, all search engines may not honor this META tag.) META tags belong in the HEAD section of your document. In this example, the page containing this META tag should not be indexed nor its content analyzed for links:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Allow only certain IP addresses to access some pages

Restricting access to web pages by IP address allows you to grant or deny specific computers, groups of computers, or domains access to Web sites, directories, and in some cases, individual files. For example, you can allow access only from computers in the fnal.gov domain or only by specific computers based on their IP address.

Restricting access by IP address is web server specific. For information about how to do this with  popular servers, see.

On Unix using an Apache server, IP address restrictions can be contained in the server configuration, or the server can be configured to allow access control commands on a directory by directory basis. If your server is an NT system running IIS, such configurations are performed through the Internet Service Manager.

Note that sometimes it is preferrable to use 131.225 for restricting access to Fermilab staff and users instead of using fnal.gov. The reason for this is that there are occasionally non-fnal computer domains that are used and administered by Fermilab staff and users. These have different domain names than fnal.gov, but still have an IP address in the 131.225 range.

To test that your IP-based protections are working, you need to be able to test web pages from an IP that is off-site or outside of the specified subnets/IPs you have restricted.

Allow access by password only

By allowing access by password only, you can restrict access to small groups of people or individuals depending on the level of security you need. Password access should by encrypted via SSL. This kind of restriction is also server specific:

On a Unix server running Apache, password access is also on a directory by directory basis. You first set up a password file and then create an .htaccess file in the target directory that uses that password file.

Do not use your system password as your web-access password. Web access passwords are not as secure as system passwords!