Recently I had to write an HTML parser for a project I've been working on for some time now. First I tried translating an open source C++ parser but it really wasn't what I wanted and it was also under the GPL. After contacting the author and realizing (or re-remembering) that I could not use a GPL derivative in a commercial library or application, I scrapped that and went back to the source: the official HTML DTD.
Re-remembering how to read a DTD after not having done so for so long was a chore, but the folks at Autistic Cuckoo helped. So I found a very helpful tutorial. I spent the next day or two writing the code in the file you linked below. I took some inspiration from a few files I found while browsing the FireFox code under the Mozilla license. The rest of it came from studying the DTD and trying to figure out a way to encapsulate that in a usable object model.
Here's an example of how to use it:
HtmlDocument doc = new HtmlDocument(url, html);
StringBuilder sb = new StringBuilder();
Collection<HtmlTag> pcdata = doc.GetList(DtdElement.A);
foreach (HtmlTag tag in pcdata)
{
if (!tag.EndTag)
{
Dictionary<string, string> attributes = doc.GetAttributes(tag);
sb.AppendLine("");
sb.AppendLine("A: " + doc.ReadSlice(tag.Slice));
foreach (KeyValuePair<string, string> pair in attributes)
{
sb.AppendLine(" " + pair.Key + "=" + pair.Value);
}
}
}
I'm releasing it under the BSD license, which I like much more than the GPL as I'm not really a "true" free software zealot. The only think I ask is that if you fix a bug or make an improvement, please share it with me and I'll put up a new version here.
NetBrick.Net.OpenUtils1.zip (35.31 KB)