Since Google just published their Web Authoring Statistics I thought I’d look back my findings from a few years ago when I looked at markup validity and usage of things like Flash and JavaScript in a small (very small compared to Google’s dataset) random sample of pages. The results — which utilise such complex analytical techniques as adding up numbers and drawing bar charts — are presented here.
My data was collected via a random web crawl which was carried out between 28th April 2003 and 29th July 2003. It consists of 34,157 pages.
From the sampled pages it was found that 94% contained at least one img tag. One page — http://applyresults.about.com/bkey.htm — contained well over 2,000
occurrences of <img width="1" height="1">, perhaps this is some kind of nasty layout hack? The page certainly didn’t seem to believe in CSS for layout. The median number of img tags
on a page was 21 and nearly 30% of the pages contained between 1 and 10 img tags.
Over 97% of pages in the sample used the title tag. Back in 1995/96 Tim “XML” Bray found that 87.5% of pages had a title. The apparent slight upturn in the usage of the title could
be the result of a few things, for example:
tables for layout HTML “guides” advocate using the title tagtitle25,000 of the pages in the sample were examined to see if they made use of JavaScript, VBScript or Flash. Not surprisingly JavaScript was by far the most common — appearing on nearly 61% of the pages checked. Interestingly, all the pages that used VBScript were also using JavaScript and the sole use for VBScript on the pages in the sample appeared as part of a Flash detection script.
| Number of Pages | Percentage of Pages | |
|---|---|---|
| JavaScript | 15,214 | 60.86% |
| VBScript | 222 | 0.89% |
| Total Pages Using Scripting | 15,214 | 60.86% |
With Flash, I searched for usual components of the default Macromedia markup. The IE Win specific object tag (including classid) and the embed tag.
About 1.9% of pages in the sample were found to be using at least one of these tags. Not surprisingly the vast majority of the pages using Flash contained both of these tags. All but one of these pages had an
embed tag, the only page not to use embed (hence providing the Flash content only to Internet Explorer) was found on MSN.
| Number of Pages | Percentage of Pages | |
|---|---|---|
| Flash Object Tag | 422 | 1.69% |
| Flash Embed Tag | 476 | 1.90% |
| Total Pages Using Flash | 477 | 1.91% |
Of the pages in my sample an encouraging 15,313 contained a DOCTYPE (approximately 45%). However, do these pages follow the standard as they claim? I was able to validate 14,125 of these
documents and the following results were obtained.
| Number of Pages | Percentage of Pages | |
|---|---|---|
| Valid | 973 | 6.89% |
| Invalid | 5,966 | 42.24% |
| Errors Prevent Validation | 7,186 | 50.87% |
Some of the most common causes for being unable to validate a page were,
Of the 973 valid pages, the following flavours of HTML were used.
Disappointingly of the few valid pages in the sample, most are using a transitional DOCTYPE, which still allows
all kinds of nasty presentational gunk. Interestingly 17 pages were using valid XHTML 1.1. Although I don’t know for certain if these pages were served using an XML media type as is required, my suspicion would be that they were using text/html.
This means that only 2.8% of the entire sample could be validated successfully and only 0.4% validated
with a strict DOCTYPE.
In the 5,966 invalid pages nearly 20% contained between 1 and 10 errors, at the other end of the scale 12.8% of the invalid pages had over 300 errors and 7.5% had over 450 errors. The median number of validation errors was 60. The most errors found on a single page was 3,455. Ouch.
With such a high proportion of pages failing validation it suggests to me that a lot of pages
contain a DOCTYPE without much (or any) consideration about writing valid code. Perhaps this is the result of
a lot of “copy and paste coding”, or the use of web authoring tools that automatically include a DOCTYPE in their output
without enforcing validity.
A Walker, MJ Evans: “A Random Walk Web Crawler with Orthogonally Coupled Heuristics”, Proceedings of the Fourth International Network Conference (INC 2004), Plymouth, UK. (2004)
T Bray: “Measuring the Web”, 5th International World Wide Web Conference (May 1996) http://www.ra.ethz.ch/CDstore/www5/www134/overview.htm
Permalink. Posted on 30th January 2006 in Web Standards.
It's masochistic to try be a webdev professional and stick to html 2.0 or other rigid quality standards while "that" browser (both 6 & 7) is still out there making it inconsistent for everyone.
While page validity is nice… and I mean really nice, it's still not 100% practical, sure it's technically possible if you don't really want to finish many projects, but imho is a standard for standards sake.
Don't get me wrong I believe in and respect the wonderful work of the the W3C and WaSP, but there are certainly some wrinkles which make these kinds of stats premature perhaps.
Sorry, comments for this item are currently closed.








