Wednesday, June 24, 2009

HTML parsing

In a very small personal project I needed to parse an HTML page. I thought that HTML = XML and so I tried to load the response string into an XDocument; but I was too optimistic. I got a lot of errors from the XDocument.Parse() method. I read on the internet that it is only possible to do this if the web page is XHTML and follows all the standards. But there are a lot of web pages out there which never heard nothing about standards. A good example are pages from Microsoft. This page of the Sysinternals Suite for example produces 53 errors and 14 warnings when validating it with the W3C validator!!

There is a solution to the HTML parsing problem for C#. There exists a project called HTML Agility Pack on CodePlexx which provides a mechanism to parse an HTML string and to navigate it like ian XmlDocument. It works great but it does not provide support for LINQ queries which is very nasty if you are accustomed to it (like me :-))
But also for this exists a solution. Since the source code of the HTML Agility Pack is open for everyone Vijay Santhanam created the ToXDocument() method, which converts the HtmlDocument to XDocument. This is then queryable with LINQ to XML. The post about this and a link to his project can be found here.

It would be nice if they would add direct LINQ support to their HtmlDocument class but maybe they will do so in the future...

No comments: