A Tool for Finding Co-Citation Links To List of Sites

March 20 2014, 4:02pm
0 Comments

I suffer programmers-block, which I define as an affliction whereby I will write my own code/software to solve problems, rather than use 3rd party software

Anyway, one such program I wrote was a PHP program that takes the link data input from a tool such as Majestic or open site explorer and analyse the data to find the really juicy links, including the "co-citation links" i.e. the pages/domains that are doing most of the co-linking to the top pages. These pages with a lot of co- and cross-citation links usually end up being pretty decent linking opportunities and good link neighborhoods.

The data the program spits out is pretty good (can also add weights to known domains or TLD's like .gov or .edu).

Soooooo... Tryig to figure out what to do with this software... I may provide the source-code online for free, make a commecial solution where you can upload the raw data and get the final results out, or something else. Let me know what you think I should do. Contact me via email, Facebook, Twitter, etc and let me know if you would like to have this tool available to you - or if you think it's derp.

Reciprocal Links - Are They Good Or Bad?

June 27 2011, 11:24am
3 Comments

Ah... reciprocal links. This is definitely one of the more polarized topics in the SEO world. Some SEOs will tell you that they are mega black-hat, punishable by instant death in search engines. Others will tell you that they are absolutely fine and encourage them. As with most things, reality lies somewhere between these to points of view.

What is a "Reciprocal Link"?

In the most simple form, a reciprocal link is a link that points to a Website which in turn has a link back. So, if Site A links to Site B and then Site B links back to Site A, the links are called "reciprocal links".

In contrast, if Site A links to Site B and Site B does not link back, then the link is called a "one-way link".

So, What's Up With Reciprocal Links?

Here's the truth: Only do reciprocal links when the two sites are of high quality and when the link is of value to the Website visitors. If you follow this principle, you shouldn't have any problems.

For example, if you own a Website that sells bicycle equipment, it might make sense for you to link to a bicycle tire manufacturer's Website. Conversely, it might make sense for them to link back to you, if you sell their tires for example. Something like this is absolutely fine and passes the smell test.

However, if your bicycle Website links to a casino or prescription drug Website and they link back, something about this smells fishy and Google would probably look at that very closely. If you have multiple links like this, you are almost certainly in trouble.

But... Don't Forget About Link Farms

Site-to-Site reciprocal links may be fine if the site quality is high and they are in some way related, however one thing you definitely should try to avoid is getting involved in link farms or complex linking schemes. For example, if you own 10 sites and you try to link them altogether in a daisy-chain type of way, or if you have all of them reciprocating links with each other (even if they are related), then Google may think that you are trying to manipulate ranking, and again you could find yourself in deep trouble.

Summary

Remember: Only link to high quality sites that are related to yours, especially if the link is going to be reciprocated. And don't get involved with daisy-chain linking or cross-linking sites.

A Quick Refresher on SEO and 301/302 HTTP Redirects

November 5 2010, 12:45pm
2 Comments

By Darrin J. Ward

I'll preface this refresher on 301 and 302 HTTP redirects by saying that we always strive to plan the layout of Websites so that we will never need to move or rename pages. You should try to do the same!

However, sometimes it's unavoidable and pages need to be moved or removed entirely. When that happens, it's very important that the right strategy be used to "redirect" the page from the old URL to the new URL. There are two viable options for doing page redirects: a 301 redirect or a 302 redirect. The 301/302 number refers to the "status code", which is sent by the Web server to robots/crawlers/browsers, informing them of what action is being taken.

A 301 redirect code indicates that the move is PERMANENT. This type of redirect should be used when you know that the page WILL NOT move back to its original location.

A 302 redirect code indicates that the move is TEMPORARY. This type of redirect should be used when you know that the page WILL eventually move back to its original location, or somewhere else.

In the vast majority of cases, it's the 301 redirect that you will want to use. The 301 redirect passes most of the PageRank and "link juice" from the old page address to the new page address, which is exactly what you want to help maintain rankings and PageRank. I will point out that there is going to be some loss due to the 301 redirect (our internal research suggests anywhere from 10-25% is lost), which is why it's best to avoid redirects altogether, where possible.

302 redirects do not maintain the PageRank and link juice in the same way.

Implementing 301 and 302 Redirects

The implementation of these redirects is usually done at either the server level or in the programming code. Here are some samples.

301 Redirect in PHP

I wrote the following simple function to allow me to perform 301 redirects:

function Redirect301($GoTo) { header("HTTP/1.1 301 Moved Permanently"); header("Location: $GoTo"); }

This function can be called in your php code like this:

Redirect301("http://www.example.com/new-url/");

302 Redirect in PHP

PHP has a built-in function called redirect() which does a 302 redirect. You can call redirect() like this:

redirect("http://www.example.com/new-url/");

301 Redirect in Apache using .htaccess or httpd.conf

Using the HTTP server itself is a popular way to perform redirects. Because I use Apache most of the time, I'll limit the discussion to the .htaccess and httpd.conf methods for Apache, and leave IIS and other servers alone for now. The easiest way to perform a 301 redirect is to use the Redirect directive, which can be used like this in .htaccess or httpd.conf:

Redirect Permanent /old-page.htm /new-page.htm

302 Redirect in Apache using .htaccess or httpd.conf

302 redirects also use the Redirect directive, but without the Permanent flag. i.e:

Redirect /old-page.htm /new-page.htm

Testing Redirect Response Codes

You should always check that the server is sending the correct redirect response code. Don't just assume that your redirects are working correctly. Here are a couple of tools that I recommend for testing HTTP redirects (and server headers in general):

Google Penalty for No rel=nofollow on Affiliate Links

September 24 2009, 10:11am
6 Comments

Barry Schwartz over at the Search Engine Roundtable reminds us today that you should use rel=nofollow on your affiliate links, or else you may receive a Google penalty.

The inherent illogic of stuff like this makes my blood boil sometimes... Why would/does Google penalize content from ranking just because links to affiliates or other sites do not use rel=nofollow? Either the content on the page is useful and it deserves to rank, or it doesn't. I don't see why having links lacking rel=nofollow alone should be a determining factor in that decision. Using rel=nofollow is a technicality.

If Google determines that the links on a page are against their paid-linking policy, then they should just discount any "link juice" that might get passed on from them. That's something they could do transparently in the background, without having to force Webmasters to consider this ridiculous rel=nofollow tag, and without having to deprive searches of valuable content (assuming Google otherwise determined it to be valuable except for the non-rel=nofollow affiliate links.)

Alas, the Google insidiousness continues, and we continue to begrudgingly comply forthwith so that we may get some rankings love! Although the whole thing does remind me of the pied piper sometimes :)

How Important is an ODP/DMOZ Links for SEO?

July 6 2009, 2:33pm
5 Comments

If you've been in the Internet Marketing industry for any length of time, then you will have heard of the "ODP" or "DMOZ", the Open Directory Project that resides at www.dmoz.org. The ODP is a large general Web directory edited by volunteers. And for years it was considered almost the holy grail for inbound link developers. Some still consider it to be so.

A member at WebmasterWorld asks "Is DMOZ still relevant in 2009?". The responses are interesting.

As part of our SEO campaigns, we do perform directory submissions to a select number of top-tier general directories and a small number of niche directories (the number depends on the niche). The ODP is still in the top 3 of our most desirous general directory link acquisition targets. But it's certainly not a holy grail of any sort.

The ODP certainly has is problems. It's very slow to get anything listed in the ODP due to the lack of editors/volunteers as compared to the volume of submissions they receive. Internet users seem to be tending away from directory type Websites and converging on social/search type sites. And, ODP hasn't done anything even remotely innovative in years (in fact, I don't know if they've done anything innovative, ever.)

But the ODP still gets used in countless places across the Web. So a listing/link in the ODP inherently means links from many other places. The value of the ODP link itself probably carries more weight than all of the subsequent links combined, but it's still a positive.

Yep - for me submitting to the ODP is still relevant in 2009. Not as much as it used to be, certainly. But it's still relevant. I do however recommend that you read my insights on submitting to directories for SEO.

Updated SEOmoz SEO Best Practices / Policies

June 23 2009, 2:23pm
0 Comments

SEOmoz has published some updated SEO best practices guidelines. The guidelines are apparently based on "correlation data", which means that they looked at rankings and analyzed the different components on the ranking pages.

The list of SEO best practice items gives recommendations for:

  • Title Tag Format
  • The Usefulness of H1 Tags
  • The Usefulness of Nofollow
  • The Usefulness of the Canonical Tag
  • The Use of Alt text with Images
  • The Use of the Meta Keywords tag
  • The Use of Parameter Driven URLs
  • The Usefulness of Footer Links
  • The Use of Javascript and Flash on Websites
  • The Use of 301 Redirects
  • Blocking pages from Search Engines
  • Google Search Wiki's Affect on Rankings
  • The Affect of Negative Links from "Bad Link Neighborhoods"
  • The Importance of Traffic on Rankings

This is great stuff, but as with everything in the "SEO" world, it needs to be taken with a pinch of salt. Each element that gets analyzed essentially introduces another unknown variable into a simultaneous equation.

One of the most interesting items is that H1 tags have been reduced to having nearly no importance in search engines. What I'm wondering is whether or not SEOmoz also looked at the CSS styling for the H1s to determine if H1s styled to a smaller font carry less weight, or if the reduced importance of the H1 is blanketed. We know that Google look CSS and JavaScript.

Freshness Optimization - Optimizing for Google Fresh Rankings

June 18 2009, 6:57pm
2 Comments

Bob Heyman today on Search Engine Land notes that the Google freshness factor may mean big implications for retailers. He notes that the EVP of ice.com, a large Internet retailer, is making proactive changes to their site because of the recent search "options" functionality introduced by Google that allow searchers to select "recency" as a criteria.

They are indeed correct. When you search on Google you will see a "Show Options" link at the top of the SERPs. When you click this link, you will see the "recency modifiers" options of "Any time", "Recent results", "Past 24 hours", "Past week" and "Past year". These allow searchers to refine the search results based on how recently the pages were updated.

If you sell products online then you probably don't need to update your product pages all too often. This will have a negative impact on your traffic levels if many people adopt the usage of Google's recency modifiers, because your pages that haven't been updated in a long time won't get listed in SERPs that require recently modified pages.

So, what can you do about it?

The first few things that come into my mind are: "daily changes", "Last-Modified", "checksum" and "page size". If you can keep all of these in mind and know how they relate to each other, then you should be able to engineer yourself into always having fresh content.

Google are looking for pages that are recently modified, so the best way to fit into that criteria is to actually add new content to pages daily. Keep in mind though that they are probably look for pages that exceed some threshold of new content before the page is actually considered changed or updated. So just adding or changing 1 sentence on a page with 100 sentences probably isn't going to cut it. I don't know what the threshold is, but I would be comfortable recommending a guideline minimum of 10-20%. This means 1 or 2 new stories every day for a page that normally features 10 stories.

I know what you're thinking... I'll add some random content and every time a search engine sees the page it will be different. I generally advise against this because if Google find that your content is completely random, then they will be a lot less confident sending traffic to you for a specific keyword, given that the relevant content that was on the page at the time they spidered it will likely be gone when a user goes to see the page. Frequent change = good. Random = bad.

So. Commit to making a few changes throughout the day and you should always be there for a "Past 24 Hours" search.

"Last-Modified" is an HTTP header which a web server sends with the response to a request. The Last-Modified header tells the client (the search engine spider in this case) when the page was last modified. It's very likely that Google and other search engines wanting to determine freshness will look for this header. However they won't completely rely on it because it can be "faked" to whatever date the Webmaster wants. So, search engines will still look for content changes. Always sending the current time is bad.

It's important to note that the Last-Modified header is not always sent by default. It is sent most of the time with static content/pages, but sites that are dynamic generally don't send this header by default due to the complexities in calculating the true last time of modification. If you're selecting a CMS, this may be a worthy consideration. Incidentally, there is also something called the "If-Modified-Since" header, which you should look into.

Finally, a quick and dirty way to check for changes to a page would be to compare the checksum values and the file sizes to previous versions of the document. I won't go into much detail here because I'm not sure that Google are using these methods, but if 2 versions of the same file pulled on different times have exactly the same size, then there is at least a small probability that they are identical.

The checksum method is more accurate, but still not perfect. A checksum comparison will compare the checksum of 2 versions of the same document, and if the checksums are identical then there is a good chance that the documents themselves are identical. This method gives a pretty accurate yes or no answer as to whether the 2 documents will be identical. It does not measure the degree to which the documents' contents differ (the percentage of content that is different).

I hope this helps to at least get you thinking about this important issue. I know that I'm using the recency modifiers quite a bit, but I don't know what the adoption numbers are; hopefully Google tells us at some point. Submit a comment or get in touch if you have something to say!

Optimizing Inbound Link Anchor Text Through Diversity for SEO

June 15 2009, 4:17pm
1 Comments

A poster at WebmasterWorld.com asks about using different anchor/link text to point to the same page for SEO.

This is a good question and I though I'd share some quick insight as to how I normally approach link development in terms of anchor-text diversification.

The short answer is that diversity in inbound links is a good thing, because it shows that: a) the links are less likely to be auto-generated or copy/pasted everywhere, and; b) the page is relevant within a variety of slightly different contexts (presuming the same general topic).

The longer answer is that although some diversity in anchor text is a good thing, you need to be careful not to overdo it in case you dilute ranking potential for the real keywords. If a page has 100 inbound links but none of them are the same or they don't even contain the same keyword, then how will a search engine know which keywords are most relevant (besides looking at on-page content!). The best thing to do is to make sure that at least 50-60% of inbound links contain the root keyword - or one of it's closely stemmed variants or synonyms - in combination with other words. The rest of the links can be whatever, but ideally there would be some consistent phrase usage too.

The other thing to consider is the positioning of the link itself, and whether or not it's an internal link or a link from an external site. If the link is from a global navigation menu, then it's not practical (or good) to make that link different on every page just for SEO (plus, UX people would scream at you). Also, if the links are internal, then I think the tolerance to lower diversity in link text is higher (the same people will probably use the same descriptions multiple times). Considering links from multiple external sites, it makes sense to think that there should be higher diversity in link text because there are probably multiple writers involved, and no 2 writers will do exactly the same thing.

Webmaster Jam Session 2007 Slides

September 24 2007, 9:51pm
0 Comments

by Darrin J. Ward:

A big thank you to everyone that came to the Webmaster Jam Session this year in Dallas, TX. Although I got a lot less rest than I would have liked, I can truly say that I had a fantastic time.

I had one or two requests for the slides from this year's Search Engine Strategies, so without any further delay, click the link below for the PDF file. Please contact me if you have any questions.

Search Engine Strategies Slides (September 22nd 2007)

Vital Academic Papers/Articles for SEO (Search Engine Optimization)

August 4 2007, 2:21am
0 Comments

by Darrin J. Ward:

I've always been greatly interested in mathematics. Well, not always, but I did come to have a lot respect for applied mathematics and physics during my latter years of school and college. Now, I have to also admit that I don't understand as much as I'd like, because it would simply take far too much time to learn it all. The deep stuff is beyond me and I admit that. Nonetheless, I remain fascinated by the sheer logic in math and the fact that it transcends race, time, other languages, etc. It's a universal language

Ever since I learned about Fermat's Last Theorem, I've been absolutely engrossed by the notion that a simple-looking and simple-sounding statement could boggle the minds of the world's greatest mathematicians for over 350 years. The theorem states, simply, that xn+yn=zn has no solutions where x,y and z are integers greater than zero and n is an integer of value 3 or greater. You'll note that n=2 would be the pythagorean theorem!

So, where is all of this going and how does it relate to SEO? Well, in reading the amazingly complicated Proof of Fermat's Last Theorem [PDF] by Andrew Wiles (and yes, I've actually had a printed copy in my office for the last few years), I've been forced to learn a little bit about some intriguing things in number theory. One such thing was Eigenvectors. In doing further research on these I came across a wonderful paper entitled "The $25,000,000,000 Eigenvector - The Linear Algebra Behind Google" by Kurt Bryan & Tanya Leise, which is basically about Google's PageRank (an Eigenvector).

I've read quite a lot of academic papers that theorize on various thing, but I had not come across this particular one before, so it was a pleasure to look through it. I mostly use academic papers as a source of inspiration rather than a solid foundation for an SEO campaign. They are extremely wonderful in provoking me to think about abstract things which eventually help me get ahead in the SEO world.

The fact of the matter is that search engines are nothing more than big calculators (though, with an arguable component of manual reviewing, a-la Google's Patent # 7096214). If you know how they work and understand the steps that they make in performing their calculations, then you have a significant competitive advantage. Looking at what's being proposed in these academic papers therefore makes a lot of sense as they are a great source of the latest in terms of strategies.

So, here are some of the papers that I usually recommend to people wanting to learn more. They do have a lot of mathematics in some cases, but you can usually get some good info even without understanding everything (I will update this list every-so-often, Contact me with addition considerations):

Authoritative sources in a Hyperlinked Environment
-- by Jon. M. Kleinberg

Site Level Noise Removal for Search Engines
-- by Andre Luiz da Costa Carvalho, Paul-Alexandru Chirita, Edleno Silva de Moura, Pavel Calado, Wolfgang Nejdl (2006)

The Anatomy of a Large-Scale Hypertextual Web Search Engine
-- by Sergey Brin and Lawrence Page

A Survey of Eigenvector Methods For Web Information Retrieval
-- by Amy N. Langville & Carl D. Meyer

ParaSite: Mining Structural Information on the Web
-- by Ellen Spertus

The $25,000,000,000 Eigenvector - The Linear Algebra Behind Google"
-- by Kurt Bryan & Tanya Leise

Preventing Bad Bots / Scraping / Email Harvesting

July 19 2007, 10:12am
0 Comments

Every serious webmaster than I know, and especially SEO's, have complained at least once about their content being stolen (scraping). Computer software known as "spiders", "robots" or "bots" regularly crawl the internet and visit our websites. Search engines use this spidering software to visit your website and include your pages in their indices. Unfortunately, people that are out to steal your content also use this software to download websites in huge volumes and then republish it elsewhere, thus detracting from the uniqueness of the "scraped" content. Or, some robots just relentlessly rip through pages in an attempt to harvest as many email addresses as possible - so that they can spam them later. They way to stop either these scrapers or email harvesters is the same. Update: SpyderTrax is a tool that automatically bans bad robots and tracks good ones.

Obviously, we don't mind the search engines from accessing our website because they are "good" bots, but how do we prevent the content-robbing "bad bots" from accessing our websites?

Many people will tell you about something called mod_rewrite in the ".htaccess" file , which is an Apache directives file. Many of the directives/code which you will find on the internet use a very simple filtering system to prevent known-bad robots from accessing your content. The problem is that almost all of them rely on the spider-supplied "User-Agent" field. Some others rely on blocking known-bad IP address blocks. But, none of them allow you as a webmaster to dynamically detect and block these bad robots.

First, let me give you a brief summary of all of the information that we will have available to us in order to make a sound judgement.

  • IP Address: The IP Address is a numerical identification of the computer making the request.
  • Requested URL: This is simply the page being requested by the robot.
  • User Agent: This is how the spider identifies itself. This value can be easily faked, so it's very untrustworthy.
Through analysis of these 3 parameters, it is possible to design an accurate system which will prevent unauthorized robots from accessing (scraping) our content.

I have being using a proprietary script for some time that attempts to solve the bad robots problem. Today, I'll share with you the logic that I have implemented, which has demonstrated great success. I know of a few other people that also use the following method with some variation. In fact you may be able to find a script or service that does this for you.

An brief overview of the steps that I take to detect scraping "bad bots" is (also see additional considerations):

  1. See if client/spider follows a blank URL.
  2. If it does, get the IP address and perform a reverse DNS lookup.
  3. If the DNS lookup resolves to an untrusted domain, block access in .htaccess (or httpd.conf).

So, here is a some more explanation about the proceedure:

I have coded a hidden link into all of my web pages. Why? because robots/spiders follow almost all links, and I want those spiders to follow this special hidden link, not ordinary users - so I made it invisible. By invisible, I mean that it have no anchor text... for example: . Note that this code links to a php file and that I have used the rel=nofollow attribute (to prevent it from being followed by spiders that understand that command).

It is logical to assume that the majority of hits to this php file will be culpable, though we cannot yet accurately form an opinion as to whether the hit is from a good robot, a bad "bot" or some other source (such as a browser pre-fetch). However, we can use the php file to try and help us out, since only automated requests should be made to this file.

So, again, what information do we know from the hit to the php file. Remember, for each hit we have an IP address, requested URL and the User Agent. We can't really use User-Agent field, because it is supplied by the requesting agent, and thus potentially mendacious. We can trust the requested URL because it is logged on our side (but we've already exhausted this to our advantage by detecting the hit to robots.php). We can also trust the IP address.

Yes, IP's can be faked/spoofed, but the majority of scrapers are small time and don't have the resources or know-how to perform IP Spoofing (which is actually very difficult to do over TCP/IP to websites).

OK, so let's look at the IP. If we suspect the request is suspicious (which it is since it followed the invisible link to the php file), then we need to determine whether or not the IP can be trusted i.e. if it's a good or bad bot, and that can be done via. reverse DNS Lookup. Reverse DNS Lookup tries to translate an IP address into the domain name that it belongs to. Here is an example of some IP addresses (taken directly from my log file) and the domain names to which they correspond:

  • 66.249.72.12: crawl-66-249-72-12.googlebot.com
  • 207.234.130.25: 207-234-130-25.ptr.primarydns.com (I found this masquerading with User-Agent "Googlebot")

So, we found that the first IP address does actually belong to Google, but the second one doesn't. We immediately ban the second IP addresses by adding it as a "deny from" in the .htaccess file, because it is crawling our site and downloading our content, as evident by its request to the php file in the blank link.

We don't know what that domain is and we don't want to trust it with unrestricted access to our content. When I add an IP address to my list, I regenerate my .htaccess ban file immediately, which ensures the bad bot is banned immediately.

At this point, you should realize that you will need to compile a list of trusted domain names. You should also thoroughly test your script so that you know it won't erroneously block a trusted domain.

I'm going to leave the logisitics of the unban up to you. I present my banned users with a captcha challenge to see if they are really humans. I also automatically unban IP addresses after a set time frame. Please read the additional considerations below.

I'm not going to talk about the script too much other than how it operates. Many people will use different platforms and programming languages. The above described system could be coded for almost any platform, with almost any language, so I won't talk about the actual code. You should however consider the following: -

Additional Considerations:
I'm not going to talk too much about the these, but you should be aware of the following, if you choose to use the above descibed system:

  1. IMPORTANT: You simply must have some kind of feature on the script that will let a user unban themselves. Inevitably, a minute percentage of false positives will occur, and you don't want that visitor to be turned away forever. The way to accomplish this is by having some kind of a challenge response if the user's IP does get banned. I use a custom Captcha system which allows the user to unban their IP. Human users can perform the unban and continue surfing whereas the automated bad bots cannot.
  2. Some users will visit your site through multi-IP proxies (AOL is a good example), so you need to account for hits from multiple IPs, otherwise your ban program may get confused.
  3. Browser Pre-Fetch mechanisms are likely to give a false positive on this system, since they will follow the blank URL, so looking for the pre-fecth header is something you might want to do. However, be aware that this header could also be sent by a bad bot, so don't trust it too much. It's a little more trustworthy than the User-Agent field.
  4. There can be freak occurances where a hit is made to the php script by a legit visitor. So, you may want to think about running multiple instances of the system in parallel, wherein a ban only occurs if hits are made to all of them consecutively. If you do choose to work on a system like this, one thing to look at might be the number of seconds between the hits, because it's only automated systems that will make multiple hits in very short time-frames (1 or 2 seconds).
  5. Spam bots change IP addressess every now and then because they do eventually get blacklisted. When that happens, the old IP address gets reassigned to other web surfers that are not necessarily bad, so only block IP addresses for a maximum amount of time (I go for 1 month).
  6. It is also possible to perform an "IP Whois" on an IP address to find the NetBlock owner, which would give you some extra details, in addition to a simple reverse DNS lookup. This could be helpful for making a ban decision.
  7. Sometimes search engines have IPs that resolve to domain names other than what would seem intuitive. You need to be aware of these so that they don't get banned. These are usually found through trial and error.
  8. You may also wish to consider adding the entire C-Class IP block to the ban list, since it is possible that the bad bot with use multiple IPs from within the same range, and you certainly do want to prevent those from grabbing your content.
  9. You will need to use a database to keep track of all the bans, IPs, unban, times, requested and referring URLs, etc... I use MySQL with about 15 fields (yes, there is that many parameters to track, when you think about it!!).
  10. You must keep the script under close observation during its test run. You will need to be "tailing" and "grepping" log files to keep an eye on your script and what it is doing.

This all sounds like a relatively easy process, and for the most part it is, but it is very very tedious. You simply must try to think outside the box when you are implementing the logisitics of this program. I haven't quite told you everything but you do now have enough information to detect and block bad bots.

I wanted to make the script that I use available to the public (for free of course). The unfortunate reality is that it would take too much effort to tidy it up the code and make documentation, so I'm going to have to refrain from doing that for now, unless I get enough requests for it.

I hope that this will shed some insight on how to block bad bots, because you really need to protect your content from thieves, email harvesters too! Your thoughts and feedback is welcome.

Search Engine Optimization FAQ

July 11 2007, 4:05pm
0 Comments

This is actually quite an old document, but may be of interest to some...

NOTE: THIS DOCUMENT IS VERY OLD [2002] AND CONTAINS A PLETHORA OF GRAMMATICAL AND SPELLING ERRORS. IT HAS BEEN FLAGGED FOR REVISION, BUT IN THE MEANTIME, READ AT YOUR OWN RISK.

1. What is SEO (Search Engine Optimization)?
2. Why is search engine traffic so important?
3. What are META tags?
4. What are spiders / crawlers / robots?
5. Is optimization a difficult task?
6. How do I go about optimizing my website?
7. How long will it take before I start ranking better?
8. Will virtual / name based hosting be an issue?
9. Should I use hidden text on my website to help rank better?
10. Will using dynamic pages harm my rankings?
11. Do filenames and directory structure matter?
12. What is a "reciprocal link"?
13. What is "PageRank (PR)"?
14. What does "cross linking" mean?
15. What are "inbound links"?
16. How can I find how many inbound links I have?
17. How important are inbound links?
18. How can I tell if my site has been banned?
19. What is "Cloaking"?
20. Does PageRank™ effect my rankings in Google?
21. What are "doorway pages" and should I use them?
22. What is unetical SEO?
23. I use framesets, is that OK?
24. Can I use JavaScript on my website?
25. Is it OK to use Flash?


What is SEO (Search Engine Optimization)?
Search Engine Optimization (abbreviated by SEO) is the process of designing, writing, coding, programming, and scripting your entire Web site so that there is a good chance that your Web pages will appear at the top of search engine queries for your selected keywords.

Why is search engine traffic so important?
A very large percentage of people looking for stuff on the internet find what they're looking for through search engines. Given that most people will only browse through 20 or 30 results in order to find what they're looking for, it is important that your website rankings highly, otherwise you risk loosing a vast amount of traffic to competitors.

What are META tags?
META tags are a part of the HTML code which is viewable only to those that are specifically looking for it. The META tags cannot be seen directly through a browser such as Internet Explorer or Netscape. These META tags can be used in order to supply information to spiders / crawlers or other robots that are looking for alternate information. These tags may present information such as keywords, site description and title etc.. Some search engines no longer use these tags but some still do.

What are spiders / crawlers / robots?
The terms "spiders", "crawlers" and "robots" all refer to a computer program which is designed to browse websites and download information. The information that it usually downloads is the HTML source code. In the case of search engines, this source code is stored in a large database and later analyzed.

Is optimization a difficult task?
No, but it requires a lot of work and patience. The work part involves making the changes to your HTML code manually, which can take a while. The patience part involves nothing more than waiting and monitoring the results of your optimization in search engines.

How do I go about optimizing my website?
Firstly, you need to have the correct tools at hand. Such tools would include keywords density analyzers, server header checkers, meta tag analyzers, spider simulators etc.. Such tools can be found in abundance on the internet. Check out Search Engine World for some tools which are of vital importance.
Next you will need to decide for exactly which search terms you wish to optimize your site for. It is extremely difficult to optimize a site for more than 3 or 4 search terms. Do some reason with some keyword suggestion tools and try to discover which search terms both currently bring you the most traffic, and which ones are most used by web surfers.
The exact optimization proceedure varies by search engines and it changes over time. For the latest tips on optimization for any given search engine, check out the SEO Chat Forums. After you find out what changes you need to make, employ those changes to your site / HTML code.
More often that not changes will need to be made to the directory structure of a website, along with changes to filenames and resources.

How long will it take before I start ranking better?
Generally speaking the search engines update once a month. However, because the information presented in a given update gets crawled a few weeks previous, it can take as long as 6 to 8 weeks before changes to your site will be reflected in search engines.
Some search engines, such as Google, are making some rapid changes in technology which allow them to update their database faster, sometimes within 24 hours. Sometimes you will be lucky enough to get a quick update, and sometimes not.

Will virtual / name based hosting be an issue?
No. Almost every search engine spider is capable of handling multiple domains with the same IP address. Although having a unique IP address is beneficial for some reasons, there are no serious disadvantages to using virtual hosting.

Should I use hidden text on my website to help rank better?
The use of hidden text, or any form of text that is only intended to be viewed by spiders while being kept invisible to users is strongly discouraged. Many websites in the past have been banned from search engines for using hidden text. Many search engines consider this to be "unethical" and it can easily be detected, so avoid using it.

Will using dynamic pages harm my rankings?
The use of straight forward dynamic pages, as in using the extensions .php, .asp, etc.. will not harm rankings. Using "query strings" in the URL's will however have a negative effect on rankings. The use of "?param=yes&other=no" will make it extremely difficult for search engines to crawl your website, so difficult in fact that most search engines will refuse to even crawl the website in order to conserve bandwidth and other expensive resources.

Do filenames and directory structure matter?
Many say that filenames do not directly matter, I have proven in the past however that they are part of the ranking algorithm in many major search engines. For optimal performance I suggest that filenames should be no longer than 15-20 characters in length and should contain the keywords for the content of the page, with words being delimited with dashes "-". Directory structure too is important, with directory names including at least one target keyword and the directory structure should be kept as close to the root as possible i.e. www.site.com/page.htm is much better than www.site.com/folder/folder2/page.htm.

What is a "reciprocal link"?
Let's imagine that there are 2 websites, one is "Site A" and the other "Site B". If Site A provides a link to Site B and Site B simultaneously provides a link to Site A, then the 2 respective links are known as "reciprocal" links, given that each site links to the other.

What is "PageRank (PR)"?
PageRank™ is actually a trademarked term. The trademark belongs to Google Inc. and refers to the "rank" of a website as determined by Google's ranking system. The PageRank (PR) of a website is determined by the number of inbound links a website has. The more inbound links that exist, the higher the PageRank, generally speaking that is!

What does "cross linking" mean?
Crosslinking is the terms used to define a volume of sites that provide links to each other for the puspose of increasing link popularity. The sites are sometimes owned by one person or firm and have little or no other purpose that to be part of the link farm.

What are "inbound links"?
Inbound links are nothing more than links coming to a given website from an external website. For example, any links on this website to Google.com are considered to be inblund links to Google.

How can I find how many inbound links I have?
In reality there is no real way to calculate the exact number of inbound links if. The best way to find inbound links to your website is to use either Google or AllTheWeb / FAST. With Google, it is possible to search for inblund links by using the search query "link:www.example.com", this will search for all links to the site www.example.com.
On AllTheWeb, you can check for inbound links to a site by using "link.all:www.example.com".

How important are inbound links?
Over the past 2 years or so, search engines have placed a lot of importance on inbound links. Inbound links ae a good way to judge the popularity of a website i.e. the more inbound links a website has, the more important it is considered to be. Many people use words that describe a website in order to link to it, so the words used to the link have become very important. Ideally, all of the inbound links to your website should use your target keywords as the link text.

How can I tell if my site has been banned?
There is no real way to tell if your site has been banned from a search engine. If your site was listed and then suddenly dissappears, it might be reasonable to assume the site was banned. Normally only sites that engage in some kind of spamming or poor practices get banned. For example, if you have used hidden text, URL cloaking, or been part of a link farm, then it is possible your site would have been banned. There is not usually a way to get re-indexed once a site is banned.

What is "Cloaking"?
Some sites have decided to cheat the search engines and deliver optimized content to the search spuders yet deliver differenr content to regular users. This is usually done when the site wants to use frames, java, flash or some other such media that would be difficult for search engines to read. The page is said to be "Cloaked" whenever this optimized content is delivered to search engines.

Does PageRank™ effect my rankings in Google?
In short, yes pagerank does effect how your website ranks in Google. The PageRank™ system was designed to be yet another factor in how pages should ranks, it is not the only thing that matters however. When you do a search at Google, the highest PR websites are not always placed at the top. Sometimes a page with a lower PR can outrank a higher PR website if it is better optimized for the search terms.

What are "doorway pages" ans should I use them?
Doorway pages are nothing more then pages full of spam text which are supposed to be optimized for search engines. Some times there are hundreds of doorway pages hosted on external servers, at different domains etc.. These doorway pages have links to the main websites. The theory is that if these doorway pages rank highly in search engines and visitors use them, they will eventually be taken to the main websites.

What is unethical SEO?
Unethical SEO is a term that is used to describe search engine optimizers, or the optimization of a website which uses any form of spamming or tricks in order to falsely improve rankings. For example, the use of hidden text or doorway pages as mentioned above would be considered unethical search engine optimization.

I use framesets, is that OK?
Frames are extremely difficult to crawl for some search engines. There are a few things which you can do in order to help the crawling of frames. Be sure that you use all of the correct meta tags and that you have a well optimized title tag on the page. Use the noframes tag and place a decent level of text content in there, along with links to other pages on your website. This will help engines that read the noframes tag to scan your website.

Can I use JavaScript on my website?
Given that JavaScript is a somewhat dynamic programming language i.e. it can require human interaction, or it can cause things to change, it is not readable by most of the search engines. Although it is a somewhat simple language, it has proven too difficult to parse. Any text or link in javascript may be missed by search engines. Only use javascript where it is urgently required.

Is it OK to use Flash?
As with JavaScript, it is too difficult to parse. Some sites use Flash introductions with a skip button. Page with flash intros greatly reduce their rankings and are encouraged to stop using intros. Sites that use Flash for navigations and interaction etc.. are encouraged to use HTML navigation instead and HTML text where possible.

Darrin J. Ward Speaking on SEO at Webmaster Jam Session 2007

July 11 2007, 2:54pm
0 Comments

By Darrin J. Ward:

Last year, I was lucky enough to be invited to the Webmaster Jam Session Conference as a speaker. I guess they didn't hate me, because I've been invited back to speak at the 2007 conference, the 2nd of what is well on its way to becoming an annual event.

This year, I'd like to hear from some of the people that plan on attending, and about what they'd like to hear discussed, in regards to SEO anyway. One hour is never enough to cover absolutely everything [though we all do our best], so if there's more interest in some areas versus others, then we'll adjust the presentation according to the feedback. Please use the Contact Form (select "General Contact") and let me know. I promise to return your emails.

FYI: You can find the 2006 presentation slides right here (I spoke twice last year):

  1. Promotion Methods (Presented Sep 22nd 2007)
  2. Inside Search Engines (Presented Sep 23rd 2007)

Xerox's Subsidiary PARC and Powerset: Natural Language Processing Deal

February 12 2007, 7:06pm
0 Comments

PARC (Palo Alto Research Center, Inc.), a subsidiary of Xerox, has inked a deal with search engine startup Powerset, which allows for the use of PARC's "Natural Language Processing" technology.

According to this article, Barney Pell (Powerset's CEO) believes that the technology will enable computer algorithms to return more relevant results for search queries. The gist of the technology appears to be that it can more closely analyze the input/queries and find relelvant results/answers.

Question is: do the majority of surfers actually enter natural language, or have they been trained into entering just keywords? ASK Jeeves tried to get uses to enter entire questions, but that didn't work out so well. And, if users don't actually enter natural language as queries, how will that affect the results of the natural language processing system

Anyway, I've given up on believing that new market entrants can take on Google. Now, I just sit back and see before getting excited. You are encouraged to do likewise.

Redirecting WWW to non-WWW domains (or vice versa)

January 7 2007, 11:28pm
0 Comments

So, first of all, why would anybody want to redirect a domain such as example.com to www.example.com? Well, many people feel that it allows for consolidation of in-link effects. And, technically, www is a subdomain of example.com. They are not the same, even if they may serve identical content. For example, if some people link to your site with example.com and others use www.example.com, then technically there are 2 different versions of your website. By performing a redirect, we tell search spiders that only one version of the domain should be used.

In performing this redirect, we prevent duplicate content from being a problem (www vs. non-www) and we also "combine" the power of the in-links, so we effectively have one domain at optimal performance instead of being split over two domains.

The way to accomplish this redirect is with mod_rewrite, an amazingly useful Apache Web-Server Module. The following pieces of code can be placed in the .htaccess file or the httpd.conf file. Personally, I use the httpd.conf because it doesn't have the performance hit of .htaccess, though that hit is so small that it's almost immeasurable. If you don't have a dedicated server then .htaccess is likely your only option.

To redirect Non-WWW to WWW:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.example.com
RewriteRule ^.*$ http://www.example.com%{REQUEST_URI} [R=301,L]

To redirect WWW to non-WWW:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^www.example.com
RewriteRule ^.*$ http://example.com%{REQUEST_URI} [R=301,L]

Please note that the above codes with also translate the requested page. So the page that was being requested will be rewritten to its new form. http://example.com/page.html will get rewritten to http://www.example.com/page.html and vice-versa.

SEO Interviews

August 1 2006, 7:37am
0 Comments

The 14th Colony has interviewed various different SEO's, including myself. You can read the interviews and/or my interview.