Earlier today I wondered how often the various forms of paragraph tags were used in HTML.
The three forms are:
A: <p>...</p>
B: ...<p>...
C: ...<p/>...
The consensus on FriendFeed was that the first form was clearly the way to go. But I was left wondering if, because of the legacy of early HTML, the second form might still be used with some frequency.
So this afternoon I wrote a quick mapreduce that examined a random sample of the web and counted paragraph tags.
The parser maintained a simple stack as it traversed through the document. Form A, <p>...</p>, is the number of balanced pairs at the end of the document, including empty pairs. Form B, <p>, is the number of open p tags that have no closing tag. And form C, <p/>, is the number of self closing tags.
The parser didn’t check for semantic validity — <p> tags could appear anywhere and still be counted — but I reasoned the number of misplaced tags would appear roughly equally in all three forms, so it was okay to ignore semantic errors for this casual experiment.
In the 833,866 html documents considered, form A (balanced pairs) was found 7,325,544 times, form B (unbalanced opening tags) was found 1,129,180 times, and form C (self closing tags) was found 44,180 times.
So the container <p>...</p> is used about 7x as much as the separator <p>, and both are used far more than the self-closing <p/>.
We also learn that 50.1% of the documents sampled contain only form A, 4.41% contain only form B, and a mere 0.21% contain only form C.
Further, 11.0% contain both form A and B, 0.70% contain both B and C, and 0.10% contain both B and C. And 0.15% of documents contain all three forms.
Interestingly, a full 33.29% of the documents sampled contained no <p> tags of any form at all. Presumably those documents use <br> and/or <div> tags to structure text into blocks, but I need to dig into this further.
The lesson for me is that, in keeping with the earliest HTML specifications, <p> is still used occasionally as a separator character in some documents. But because of the higher frequency of both A and B appearing (as opposed to B alone), it is slightly more likely than not that an unbalanced paragraph tag is that way by accident rather than by intent.
So there you have it. More than than you ever wanted to know about the simple little <p> tag.