How to identify bad robots

Discussion in 'Search Engine Optimization' started by mit, Sep 10, 2008.

  1. mit Qdoos Contributor

    mit
    Joined:
    Aug 27, 2008
    Messages:
    142
    Likes Received:
    1
    Most methods below rely on the fact that you have access to the access logs on the web server. You need to check them regularly for unauthorized accesses.

    * Robots that use entries in robots.txt to get at hidden files
    I have an entry in my robots.txt file that points to a directory that is not mentioned anywhere on the web site. Anyone that accesses it must have checked the robots.txt file. Such a site will almost always be banned.

    * Robots that ignore robots.txt
    There is a special directory on this server, namely /botsv/ which is mentioned in the file robots.txt. Any access must either be by someone surfing the net, or a robot. Any robot that access it shows that it ignores robots.txt.

    * Robots that follow links through cgi scripts
    Visible in the log files. cgi-scripts usually are not meant to be indexed, because they are used to generate dynamic web pages that change very frequently. Each access costs more or less CPU power.

    * Robots that traverse the whole web site in seconds
    visible in the log files

    * Robots that revisit the web site too often
    visible in the log files

    * Robots that are known to search for email addresses
    they are sometimes mentioned in user groups and mailing lists. However, by setting up a special web site that includes an email address that changes whenever someone loads that page it is relatively easy to spot them. Whenever an email is sent to one of those 'trap email addresses' one just has to search in the log files to find out who actually got that address. In order to do this you need access to your own domain though.

    Source - fleiner.com/bots/
    mit,

Share This Page