Barry Schwartz over at the Search Engine Roundtable reminds us today that you should use rel=nofollow on your affiliate links, or else you may receive a Google penalty.
The inherent illogic of stuff like this makes my blood boil sometimes... Why would/does Google penalize content from ranking just because links to affiliates or other sites do not use rel=nofollow? Either the content on the page is useful and it deserves to rank, or it doesn't. I don't see why having links lacking rel=nofollow alone should be a determining factor in that decision. Using rel=nofollow is a technicality.
If Google determines that the links on a page are against their paid-linking policy, then they should just discount any "link juice" that might get passed on from them. That's something they could do transparently in the background, without having to force Webmasters to consider this ridiculous rel=nofollow tag, and without having to deprive searches of valuable content (assuming Google otherwise determined it to be valuable except for the non-rel=nofollow affiliate links.)
Alas, the Google insidiousness continues, and we continue to begrudgingly comply forthwith so that we may get some rankings love! Although the whole thing does remind me of the pied piper sometimes :)
If you've been in the Internet Marketing industry for any length of time, then you will have heard of the "ODP" or "DMOZ", the Open Directory Project that resides at www.dmoz.org. The ODP is a large general Web directory edited by volunteers. And for years it was considered almost the holy grail for inbound link developers. Some still consider it to be so.
A member at WebmasterWorld asks "Is DMOZ still relevant in 2009?". The responses are interesting.
As part of our SEO campaigns, we do perform directory submissions to a select number of top-tier general directories and a small number of niche directories (the number depends on the niche). The ODP is still in the top 3 of our most desirous general directory link acquisition targets. But it's certainly not a holy grail of any sort.
The ODP certainly has is problems. It's very slow to get anything listed in the ODP due to the lack of editors/volunteers as compared to the volume of submissions they receive. Internet users seem to be tending away from directory type Websites and converging on social/search type sites. And, ODP hasn't done anything even remotely innovative in years (in fact, I don't know if they've done anything innovative, ever.)
But the ODP still gets used in countless places across the Web. So a listing/link in the ODP inherently means links from many other places. The value of the ODP link itself probably carries more weight than all of the subsequent links combined, but it's still a positive.
Yep - for me submitting to the ODP is still relevant in 2009. Not as much as it used to be, certainly. But it's still relevant. I do however recommend that you read my insights on submitting to directories for SEO.
SEOmoz has published some updated SEO best practices guidelines. The guidelines are apparently based on "correlation data", which means that they looked at rankings and analyzed the different components on the ranking pages.
The list of SEO best practice items gives recommendations for:
This is great stuff, but as with everything in the "SEO" world, it needs to be taken with a pinch of salt. Each element that gets analyzed essentially introduces another unknown variable into a simultaneous equation.
One of the most interesting items is that H1 tags have been reduced to having nearly no importance in search engines. What I'm wondering is whether or not SEOmoz also looked at the CSS styling for the H1s to determine if H1s styled to a smaller font carry less weight, or if the reduced importance of the H1 is blanketed. We know that Google look CSS and JavaScript.
Bob Heyman today on Search Engine Land notes that the Google freshness factor may mean big implications for retailers. He notes that the EVP of ice.com, a large Internet retailer, is making proactive changes to their site because of the recent search "options" functionality introduced by Google that allow searchers to select "recency" as a criteria.
They are indeed correct. When you search on Google you will see a "Show Options" link at the top of the SERPs. When you click this link, you will see the "recency modifiers" options of "Any time", "Recent results", "Past 24 hours", "Past week" and "Past year". These allow searchers to refine the search results based on how recently the pages were updated.
If you sell products online then you probably don't need to update your product pages all too often. This will have a negative impact on your traffic levels if many people adopt the usage of Google's recency modifiers, because your pages that haven't been updated in a long time won't get listed in SERPs that require recently modified pages.
The first few things that come into my mind are: "daily changes", "Last-Modified", "checksum" and "page size". If you can keep all of these in mind and know how they relate to each other, then you should be able to engineer yourself into always having fresh content.
Google are looking for pages that are recently modified, so the best way to fit into that criteria is to actually add new content to pages daily. Keep in mind though that they are probably look for pages that exceed some threshold of new content before the page is actually considered changed or updated. So just adding or changing 1 sentence on a page with 100 sentences probably isn't going to cut it. I don't know what the threshold is, but I would be comfortable recommending a guideline minimum of 10-20%. This means 1 or 2 new stories every day for a page that normally features 10 stories.
I know what you're thinking... I'll add some random content and every time a search engine sees the page it will be different. I generally advise against this because if Google find that your content is completely random, then they will be a lot less confident sending traffic to you for a specific keyword, given that the relevant content that was on the page at the time they spidered it will likely be gone when a user goes to see the page. Frequent change = good. Random = bad.
So. Commit to making a few changes throughout the day and you should always be there for a "Past 24 Hours" search.
"Last-Modified" is an HTTP header which a web server sends with the response to a request. The Last-Modified header tells the client (the search engine spider in this case) when the page was last modified. It's very likely that Google and other search engines wanting to determine freshness will look for this header. However they won't completely rely on it because it can be "faked" to whatever date the Webmaster wants. So, search engines will still look for content changes. Always sending the current time is bad.
It's important to note that the Last-Modified header is not always sent by default. It is sent most of the time with static content/pages, but sites that are dynamic generally don't send this header by default due to the complexities in calculating the true last time of modification. If you're selecting a CMS, this may be a worthy consideration. Incidentally, there is also something called the "If-Modified-Since" header, which you should look into.
Finally, a quick and dirty way to check for changes to a page would be to compare the checksum values and the file sizes to previous versions of the document. I won't go into much detail here because I'm not sure that Google are using these methods, but if 2 versions of the same file pulled on different times have exactly the same size, then there is at least a small probability that they are identical.
The checksum method is more accurate, but still not perfect. A checksum comparison will compare the checksum of 2 versions of the same document, and if the checksums are identical then there is a good chance that the documents themselves are identical. This method gives a pretty accurate yes or no answer as to whether the 2 documents will be identical. It does not measure the degree to which the documents' contents differ (the percentage of content that is different).
I hope this helps to at least get you thinking about this important issue. I know that I'm using the recency modifiers quite a bit, but I don't know what the adoption numbers are; hopefully Google tells us at some point. Submit a comment or get in touch if you have something to say!
A poster at WebmasterWorld.com asks about using different anchor/link text to point to the same page for SEO.
This is a good question and I though I'd share some quick insight as to how I normally approach link development in terms of anchor-text diversification.
The short answer is that diversity in inbound links is a good thing, because it shows that: a) the links are less likely to be auto-generated or copy/pasted everywhere, and; b) the page is relevant within a variety of slightly different contexts (presuming the same general topic).
The longer answer is that although some diversity in anchor text is a good thing, you need to be careful not to overdo it in case you dilute ranking potential for the real keywords. If a page has 100 inbound links but none of them are the same or they don't even contain the same keyword, then how will a search engine know which keywords are most relevant (besides looking at on-page content!). The best thing to do is to make sure that at least 50-60% of inbound links contain the root keyword - or one of it's closely stemmed variants or synonyms - in combination with other words. The rest of the links can be whatever, but ideally there would be some consistent phrase usage too.
The other thing to consider is the positioning of the link itself, and whether or not it's an internal link or a link from an external site. If the link is from a global navigation menu, then it's not practical (or good) to make that link different on every page just for SEO (plus, UX people would scream at you). Also, if the links are internal, then I think the tolerance to lower diversity in link text is higher (the same people will probably use the same descriptions multiple times). Considering links from multiple external sites, it makes sense to think that there should be higher diversity in link text because there are probably multiple writers involved, and no 2 writers will do exactly the same thing.
by Darrin J. Ward:
A big thank you to everyone that came to the Webmaster Jam Session this year in Dallas, TX. Although I got a lot less rest than I would have liked, I can truly say that I had a fantastic time.
I had one or two requests for the slides from this year's Search Engine Strategies, so without any further delay, click the link below for the PDF file. Please contact me if you have any questions.
Search Engine Strategies Slides (September 22nd 2007)
by Darrin J. Ward:
I've always been greatly interested in mathematics. Well, not always, but I did come to have a lot respect for applied mathematics and physics during my latter years of school and college. Now, I have to also admit that I don't understand as much as I'd like, because it would simply take far too much time to learn it all. The deep stuff is beyond me and I admit that. Nonetheless, I remain fascinated by the sheer logic in math and the fact that it transcends race, time, other languages, etc. It's a universal language
Ever since I learned about Fermat's Last Theorem, I've been absolutely engrossed by the notion that a simple-looking and simple-sounding statement could boggle the minds of the world's greatest mathematicians for over 350 years. The theorem states, simply, that xn+yn=zn has no solutions where x,y and z are integers greater than zero and n is an integer of value 3 or greater. You'll note that n=2 would be the pythagorean theorem!
So, where is all of this going and how does it relate to SEO? Well, in reading the amazingly complicated Proof of Fermat's Last Theorem [PDF] by Andrew Wiles (and yes, I've actually had a printed copy in my office for the last few years), I've been forced to learn a little bit about some intriguing things in number theory. One such thing was Eigenvectors. In doing further research on these I came across a wonderful paper entitled "The $25,000,000,000 Eigenvector - The Linear Algebra Behind Google" by Kurt Bryan & Tanya Leise, which is basically about Google's PageRank (an Eigenvector).
I've read quite a lot of academic papers that theorize on various thing, but I had not come across this particular one before, so it was a pleasure to look through it. I mostly use academic papers as a source of inspiration rather than a solid foundation for an SEO campaign. They are extremely wonderful in provoking me to think about abstract things which eventually help me get ahead in the SEO world.
The fact of the matter is that search engines are nothing more than big calculators (though, with an arguable component of manual reviewing, a-la Google's Patent # 7096214). If you know how they work and understand the steps that they make in performing their calculations, then you have a significant competitive advantage. Looking at what's being proposed in these academic papers therefore makes a lot of sense as they are a great source of the latest in terms of strategies.
So, here are some of the papers that I usually recommend to people wanting to learn more. They do have a lot of mathematics in some cases, but you can usually get some good info even without understanding everything (I will update this list every-so-often, Contact me with addition considerations):
Authoritative sources in a Hyperlinked Environment
-- by Jon. M. Kleinberg
Site Level Noise Removal for Search Engines
-- by Andre Luiz da Costa Carvalho, Paul-Alexandru Chirita, Edleno Silva de Moura, Pavel Calado, Wolfgang Nejdl (2006)
The Anatomy of a Large-Scale Hypertextual Web Search Engine
-- by Sergey Brin and Lawrence Page
A Survey of Eigenvector Methods For Web Information Retrieval
-- by Amy N. Langville & Carl D. Meyer
ParaSite: Mining Structural Information on the Web
-- by Ellen Spertus
The $25,000,000,000 Eigenvector - The Linear Algebra Behind Google"
-- by Kurt Bryan & Tanya Leise
Every serious webmaster than I know, and especially SEO's, have complained at least once about their content being stolen (scraping). Computer software known as "spiders", "robots" or "bots" regularly crawl the internet and visit our websites. Search engines use this spidering software to visit your website and include your pages in their indices. Unfortunately, people that are out to steal your content also use this software to download websites in huge volumes and then republish it elsewhere, thus detracting from the uniqueness of the "scraped" content. Or, some robots just relentlessly rip through pages in an attempt to harvest as many email addresses as possible - so that they can spam them later. They way to stop either these scrapers or email harvesters is the same. Update: SpyderTrax is a tool that automatically bans bad robots and tracks good ones.
Obviously, we don't mind the search engines from accessing our website because they are "good" bots, but how do we prevent the content-robbing "bad bots" from accessing our websites?
Many people will tell you about something called mod_rewrite in the ".htaccess" file , which is an Apache directives file. Many of the directives/code which you will find on the internet use a very simple filtering system to prevent known-bad robots from accessing your content. The problem is that almost all of them rely on the spider-supplied "User-Agent" field. Some others rely on blocking known-bad IP address blocks. But, none of them allow you as a webmaster to dynamically detect and block these bad robots.
First, let me give you a brief summary of all of the information that we will have available to us in order to make a sound judgement.
I have being using a proprietary script for some time that attempts to solve the bad robots problem. Today, I'll share with you the logic that I have implemented, which has demonstrated great success. I know of a few other people that also use the following method with some variation. In fact you may be able to find a script or service that does this for you.
An brief overview of the steps that I take to detect scraping "bad bots" is (also see additional considerations):
So, here is a some more explanation about the proceedure:
I have coded a hidden link into all of my web pages. Why? because robots/spiders follow almost all links, and I want those spiders to follow this special hidden link, not ordinary users - so I made it invisible. By invisible, I mean that it have no anchor text... for example: . Note that this code links to a php file and that I have used the rel=nofollow attribute (to prevent it from being followed by spiders that understand that command).
It is logical to assume that the majority of hits to this php file will be culpable, though we cannot yet accurately form an opinion as to whether the hit is from a good robot, a bad "bot" or some other source (such as a browser pre-fetch). However, we can use the php file to try and help us out, since only automated requests should be made to this file.
So, again, what information do we know from the hit to the php file. Remember, for each hit we have an IP address, requested URL and the User Agent. We can't really use User-Agent field, because it is supplied by the requesting agent, and thus potentially mendacious. We can trust the requested URL because it is logged on our side (but we've already exhausted this to our advantage by detecting the hit to robots.php). We can also trust the IP address.
Yes, IP's can be faked/spoofed, but the majority of scrapers are small time and don't have the resources or know-how to perform IP Spoofing (which is actually very difficult to do over TCP/IP to websites).
OK, so let's look at the IP. If we suspect the request is suspicious (which it is since it followed the invisible link to the php file), then we need to determine whether or not the IP can be trusted i.e. if it's a good or bad bot, and that can be done via. reverse DNS Lookup. Reverse DNS Lookup tries to translate an IP address into the domain name that it belongs to. Here is an example of some IP addresses (taken directly from my log file) and the domain names to which they correspond:
So, we found that the first IP address does actually belong to Google, but the second one doesn't. We immediately ban the second IP addresses by adding it as a "deny from" in the .htaccess file, because it is crawling our site and downloading our content, as evident by its request to the php file in the blank link.
We don't know what that domain is and we don't want to trust it with unrestricted access to our content. When I add an IP address to my list, I regenerate my .htaccess ban file immediately, which ensures the bad bot is banned immediately.
At this point, you should realize that you will need to compile a list of trusted domain names. You should also thoroughly test your script so that you know it won't erroneously block a trusted domain.
I'm going to leave the logisitics of the unban up to you. I present my banned users with a captcha challenge to see if they are really humans. I also automatically unban IP addresses after a set time frame. Please read the additional considerations below.
I'm not going to talk about the script too much other than how it operates. Many people will use different platforms and programming languages. The above described system could be coded for almost any platform, with almost any language, so I won't talk about the actual code. You should however consider the following: -
Additional Considerations:
I'm not going to talk too much about the these, but you should be aware of
the following, if you choose to use the above descibed system:
This all sounds like a relatively easy process, and for the most part it is, but it is very very tedious. You simply must try to think outside the box when you are implementing the logisitics of this program. I haven't quite told you everything but you do now have enough information to detect and block bad bots.
I wanted to make the script that I use available to the public (for free of course). The unfortunate reality is that it would take too much effort to tidy it up the code and make documentation, so I'm going to have to refrain from doing that for now, unless I get enough requests for it.
I hope that this will shed some insight on how to block bad bots, because you really need to protect your content from thieves, email harvesters too! Your thoughts and feedback is welcome.
This is actually quite an old document, but may be of interest to some...
NOTE: THIS DOCUMENT IS VERY OLD [2002] AND CONTAINS A PLETHORA OF GRAMMATICAL AND SPELLING ERRORS. IT HAS BEEN FLAGGED FOR REVISION, BUT IN THE MEANTIME, READ AT YOUR OWN RISK.
1. What is SEO (Search Engine Optimization)?
By Darrin J. Ward:
Last year, I was lucky enough to be invited to the Webmaster Jam Session Conference as a speaker. I guess they didn't hate me, because I've been invited back to speak at the 2007 conference, the 2nd of what is well on its way to becoming an annual event.
This year, I'd like to hear from some of the people that plan on attending, and about what they'd like to hear discussed, in regards to SEO anyway. One hour is never enough to cover absolutely everything [though we all do our best], so if there's more interest in some areas versus others, then we'll adjust the presentation according to the feedback. Please use the Contact Form (select "General Contact") and let me know. I promise to return your emails.
FYI: You can find the 2006 presentation slides right here (I spoke twice last year):
PARC (Palo Alto Research Center, Inc.), a subsidiary of Xerox, has inked a deal with search engine startup Powerset, which allows for the use of PARC's "Natural Language Processing" technology.
According to this article, Barney Pell (Powerset's CEO) believes that the technology will enable computer algorithms to return more relevant results for search queries. The gist of the technology appears to be that it can more closely analyze the input/queries and find relelvant results/answers.
Question is: do the majority of surfers actually enter natural language, or have they been trained into entering just keywords? ASK Jeeves tried to get uses to enter entire questions, but that didn't work out so well. And, if users don't actually enter natural language as queries, how will that affect the results of the natural language processing system
Anyway, I've given up on believing that new market entrants can take on Google. Now, I just sit back and see before getting excited. You are encouraged to do likewise.
So, first of all, why would anybody want to redirect a domain such as example.com to www.example.com? Well, many people feel that it allows for consolidation of in-link effects. And, technically, www is a subdomain of example.com. They are not the same, even if they may serve identical content. For example, if some people link to your site with example.com and others use www.example.com, then technically there are 2 different versions of your website. By performing a redirect, we tell search spiders that only one version of the domain should be used.
In performing this redirect, we prevent duplicate content from being a problem (www vs. non-www) and we also "combine" the power of the in-links, so we effectively have one domain at optimal performance instead of being split over two domains.
The way to accomplish this redirect is with mod_rewrite, an amazingly useful Apache Web-Server Module. The following pieces of code can be placed in the .htaccess file or the httpd.conf file. Personally, I use the httpd.conf because it doesn't have the performance hit of .htaccess, though that hit is so small that it's almost immeasurable. If you don't have a dedicated server then .htaccess is likely your only option.
To redirect Non-WWW to WWW:
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.example.com
RewriteRule ^.*$ http://www.example.com%{REQUEST_URI} [R=301,L]
To redirect WWW to non-WWW:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www.example.com
RewriteRule ^.*$ http://example.com%{REQUEST_URI} [R=301,L]
Please note that the above codes with also translate the requested page. So the page that was being requested will be rewritten to its new form. http://example.com/page.html will get rewritten to http://www.example.com/page.html and vice-versa.
The 14th Colony has interviewed various different SEO's, including myself. You can read the interviews and/or my interview.