Web Statistics

30th January 2006

Since Google just published their Web Authoring Statistics I thought I’d look back my findings from a few years ago when I looked at markup validity and usage of things like Flash and JavaScript in a small (very small compared to Google’s dataset) random sample of pages. The results — which utilise such complex analytical techniques as adding up numbers and drawing bar charts — are presented here.

My data was collected via a random web crawl which was carried out between 28th April 2003 and 29th July 2003. It consists of 34,157 pages.

Images

From the sampled pages it was found that 94% contained at least one img tag. One page — http://applyresults.about.com/bkey.htm — contained well over 2,000 occurrences of <img width="1" height="1">, perhaps this is some kind of nasty layout hack? The page certainly didn’t seem to believe in CSS for layout. The median number of img tags on a page was 21 and nearly 30% of the pages contained between 1 and 10 img tags.

Distribution of img tags on all pages in the dataset.
Number of img tags Percentage of pages
0 6.27%
1 to 10 29.80%
11 to 20 13.04%
21 to 30 12.32%
31 to 40 7.37%
41 to 50 5.18%
51 to 60 3.91%
61 to 70 3.42%
71 to 80 2.70%
81 to 90 2.06%
91 to 100 1.91%
101 to 110 1.12%
111 to 120 2.18%
121 to 130 1.72%
131 to 140 1.49%
141 to 150 0.71%
151 to 160 0.74%
161 to 170 0.64%
171 to 180 0.47%
181 to 190 0.39%
191 to 200 0.63%
> 200 2.53%

Titles

Over 97% of pages in the sample used the title tag. Back in 1995/96 Tim “XML” Bray found that 87.5% of pages had a title. The apparent slight upturn in the usage of the title could be the result of a few things, for example:

Scripting and Flash

25,000 of the pages in the sample were examined to see if they made use of JavaScript, VBScript or Flash. Not surprisingly JavaScript was by far the most common — appearing on nearly 61% of the pages checked. Interestingly, all the pages that used VBScript were also using JavaScript and the sole use for VBScript on the pages in the sample appeared as part of a Flash detection script.

Use of JavaScript and VBScript.
Number of Pages Percentage of Pages
JavaScript 15,214 60.86%
VBScript 222 0.89%
Total Pages Using Scripting 15,214 60.86%

With Flash, I searched for usual components of the default Macromedia markup. The IE Win specific object tag (including classid) and the embed tag. About 1.9% of pages in the sample were found to be using at least one of these tags. Not surprisingly the vast majority of the pages using Flash contained both of these tags. All but one of these pages had an embed tag, the only page not to use embed (hence providing the Flash content only to Internet Explorer) was found on MSN.

Use of Flash — object and embed tags.
Number of Pages Percentage of Pages
Flash Object Tag 422 1.69%
Flash Embed Tag 476 1.90%
Total Pages Using Flash 477 1.91%

HTML Validation

Of the pages in my sample an encouraging 15,313 contained a DOCTYPE (approximately 45%). However, do these pages follow the standard as they claim? I was able to validate 14,125 of these documents and the following results were obtained.

Proportion of Pages Found to Validate to the HTML Standard they claim to be.
Number of Pages Percentage of Pages
Valid 973 6.89%
Invalid 5,966 42.24%
Errors Prevent Validation 7,186 50.87%

Some of the most common causes for being unable to validate a page were,

  1. Missing Character Encoding (6,074 pages)
  2. Invalid Document Type Definition (259 pages)
  3. Incorrect Character Encoding (43 pages)

Of the 973 valid pages, the following flavours of HTML were used.

DOCTYPEs used on valid pages.
Number of Pages Percentage of Valid Pages
XHTML 1.0 Transitional 390 40.08%
HTML 4.01 Transitional 379 38.95%
XHTML 1.0 Strict 107 11.00%
HTML 4.0 Transitional 48 4.93%
HTML 4.01 Strict 16 1.64%
XHTML 1.1 15 1.54%
HTML 3.2 14 1.44%
XHTML 1.1 Strict 2 0.21%
HTML 4.0 Frameset 1 0.10%
HTML 2.0 1 0.10%

Disappointingly of the few valid pages in the sample, most are using a transitional DOCTYPE, which still allows all kinds of nasty presentational gunk. Interestingly 17 pages were using valid XHTML 1.1. Although I don’t know for certain if these pages were served using an XML media type as is required, my suspicion would be that they were using text/html.

This means that only 2.8% of the entire sample could be validated successfully and only 0.4% validated with a strict DOCTYPE.

In the 5,966 invalid pages nearly 20% contained between 1 and 10 errors, at the other end of the scale 12.8% of the invalid pages had over 300 errors and 7.5% had over 450 errors. The median number of validation errors was 60. The most errors found on a single page was 3,455. Ouch.

Distribution of the number of validation errors.
Number of Validation Errors Percentage of invalid pages
1 to 10 19.85%
11 to 20 9.07%
21 to 30 6.44%
31 to 40 6.00%
41 to 50 4.06%
51 to 60 5.08%
61 to 70 4.14%
71 to 80 3.50%
81 to 90 2.45%
91 to 100 1.84%
101 to 110 1.66%
111 to 120 1.49%
121 to 130 1.07%
131 to 140 1.21%
141 to 150 2.26%
151 to 160 1.48%
161 to 170 1.63%
171 to 180 1.16%
181 to 190 1.69%
191 to 200 2.15%
201 to 210 1.88%
211 to 220 0.80%
221 to 230 1.14%
231 to 240 0.62%
241 to 250 0.87%
251 to 260 0.75%
261 to 270 0.79%
271 to 280 0.94%
281 to 290 0.57%
291 to 300 0.62%
> 300 12.81%

With such a high proportion of pages failing validation it suggests to me that a lot of pages contain a DOCTYPE without much (or any) consideration about writing valid code. Perhaps this is the result of a lot of “copy and paste coding”, or the use of web authoring tools that automatically include a DOCTYPE in their output without enforcing validity.

A Walker, MJ Evans: “A Random Walk Web Crawler with Orthogonally Coupled Heuristics”, Proceedings of the Fourth International Network Conference (INC 2004), Plymouth, UK. (2004)

T Bray: “Measuring the Web”, 5th International World Wide Web Conference (May 1996) http://www.ra.ethz.ch/CDstore/www5/www134/overview.htm

Permalink. Posted on 30th January 2006 in Web Standards.

Comments

  1. It's masochistic to try be a webdev professional and stick to html 2.0 or other rigid quality standards while "that" browser (both 6 & 7) is still out there making it inconsistent for everyone.

    While page validity is nice… and I mean really nice, it's still not 100% practical, sure it's technically possible if you don't really want to finish many projects, but imho is a standard for standards sake.

    Don't get me wrong I believe in and respect the wonderful work of the the W3C and WaSP, but there are certainly some wrinkles which make these kinds of stats premature perhaps.

    # Posted by Elena on 10th December 2006.

Sorry, comments for this item are currently closed.

Of Interest

Hangouts

Listening