Follow @RoyOsherove on Twitter

Creating a generic Site-To-RSS tool

Download the source files.

 

Index:

Creating a generic Site-To-RSS tool 1

What you’ll need. 1

Summary. 1

Introduction. 2

Planning a site scrape. 2

Regex – a powerful scraping tool 2

Creating our scraping regular expression. 3

Getting our link. 4

Getting our Title. 4

Getting our description. 5

Getting our category. 5

Getting our publishing date. 5

Our final regex. 5

Making an RSS feed out of it 5

Validating our feed. 7

Subscribing to our feed. 8

Approaches for a generic tool 8

Building the generic SiteToRSS Class. 8

Verifying the existence of capture groups in a pattern. 11

Retrieving the site using the WebClient class. 12

Writing the RSS feed to either a file or an in-memory stream.. 13

Working with a MemoryStream and the case for Xml encoding. 14

Link prefix. 16

Using the generic class with .NetWire. 16

What’s in the download?. 17

 

What you’ll need

·         regular expression knowledge .consider reading the following articles:

o        Introduction to Regular Expressions

o        Practical Parsing Using Groups

·         Expresso – a tool for working with regular expressions

 

Summary

I’ll show how to use regular expressions to parse a web page’s HTML text into manageable chunks of data. That data will be converted and written as an RSS feed for the whole world to consume. Finally, I’ll show how to create a generic tool that allows you to automatically generate an RSS feed from any given website, given a small group of parameters. At the end of the day we will have a working RSS feed for www.DotNetWire.com .

 

Introduction

Ah, the joys of RSS. You can get the data you need, as soon as it’s available, and no nagging browsers or popups along the way. If only all sites had RSS feeds, huh? If there’s one thing that would be really nice it would be the ability to generate an RSS feed from any site I want. For example, .NetWire is a very interesting site with lots of useful information. However, the folks maintaining this site hadn’t thought about providing it with an RSS feed, which it so sorely needs.

So I got to thinking “Hmm, all the data on the site that’s important to me seems to be arranged in an orderly and predictable manner. I should be able to parse it in a fairly easy manner and make it into an RSS feed” so I started trying. It worked out pretty well. So well, that I’ve come up with a way to let you do your own site scraping using a generic tool, providing it with only simple rules expressed as a single regular expression.

Planning a site scrape

Site scraping” depicts going over a site’s HTML and “mining” it for any relevant data. All other text is discarded. This is what I intend to show here. For this article, I’ve chosen .NetWire as the site I’ll be scraping, as the outcome of this will be useful to a great many people. In planning the scraping I’ll ignore the specifics of how I actually get the text to parse and leave that topic for the end of the article.

The first thing I did was to open my web browser on the .NetWire site , right click and select “view source”. Notepad shows me the site as my future parser will see it. This raw text is the juice I’ll need to parse in order to get the data I need.

To be honest, it looked quite scary. How on earth am I going to come up with an easy way to parse such an enormous amount of information without losing my head? Scrolling through the text, however, I could start to see patterns in which “important” text, text that was relevant to me, showed up in.

There were links inside paragraphs, followed by SPANs and many more attributes. It was a nightmare to parse. Just writing all the rules in searching for a specific link or title for the RSS feed that I wanted to create was a hard enough, but I also had lots more to contend with. I had to find text inside found text inside found text. It was hardly a job for a few hours on the weekend.

So the next thing I decided to check was whether I could do the job with regular expressions.

 

Note: If you don’t care to find out how we build the regular expression for scraping the site and would rather just move to where we actually use it to create the RSS feed, feel free to jump directly to “Making an RSS feed out of it” section.

 

Regex – a powerful scraping tool

If you don’t know what regular expressions are, there are loads of articles on the subject. I’ve written a couple myself. They are referenced at the bottom of this article. You’ll need to understand regular expressions before reading how to use them for scraping a site.

Regular expressions enable us to easily extract necessary information from text. Easily. It allows us, though complex expressions provided as plain text, to recover strings that match lots and lots of rules provided by us. The data we receive back after running our expressions on a string can be as complex and as detailed as we’d like. We can even divide it into groups of text that was matched, along with group names attached to them, allowing us to easily program against the regular expression(Regex) interface (see .”Practical Parsing Using Groups” for more info)

 

Since a site is ultimately represented as plain text (be it HTML, JaveScript, or anything else), we can apply regular expressions to that text as well, allowing us to search and filter any irrelevant information quickly and easily.

 

Creating our scraping regular expression

For our RSS feed, we only need several pieces of data retrieved from the HTML for every “post” we indent to create in our RSS feed:

·         Link: A link the post reader could click to go to the specific information the In .NetWire it’s the link of the news items

·         Title: The title that will appear in the RSS reader the user reading the posts will use. In .NetWire it’s the title of the news item

·         Description: The actual text of an individual post. In .NetWire it’s the text of the news item

·         Publishing date: The date of the Post. In .NetWire it’s the publish date of the news item.

 

These various items are buried deep inside the HTML of our website. It is now our job to find an regular expression that retrieves those items, and allows us to easily reference them by code. Using out knowledge of “groups” in Regex, we want to have a group in the resulting regex for every item we want to retrieve. We’ll name them “link”,”title”,”description” and ”pubDate” respectively.

In developing our regex, I decided to use Expresso, a tool designed to help with regular expression testing.

 

In developing our regex, I’ll rely on this piece of HTML, taken from the HTML of .NetWire:

 

 

<p class="clsNormalText"><a href="/redirect.asp?newsid=4974" target="newwindow"
class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a><br>
With the explosive growth of the Internet and rapid globalization of the world's
economies, the earth is getting smaller and smaller. The applications that you develop for
a local market may soon be used in another country. If the world used a common language,
that would make the life of developers much easier. However, reality is far from perfect.
The author shows you how to make your applications ready for the global marketplace.<br>
<span class="clsSubText">Article. Sep 16, 2003.</span></p>

 

 This HTML represents one news item on .NetWire, and this is the one we’ll need to focus on.

Out first item of business today is getting the link of the news item. Why the link first? Because it’s the first item in order of appearance, which makes it the least complicated to find.

 

Getting our link

Looking at this piece of that that we want to extract the link from :

 

<p class="clsNormalText"><a href="/redirect.asp?newsid=4974" target="newwindow"
class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a><br>

With the explosive…

 

We can easily see that each link (and title) is encapsulated between two items:

<p class="clsNormalText"><a href="

àOur link

" target="newwindow" class="clsNewsHead">
 

Simply enough, the following regular expression catches all instances of such a link within our HTML file, and presents us with a group name “link” that gives us the actual redirection string of the link:

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")

 

I’ve put “\s” to prevent from declaring exactly how many spaces or tabs reside between the tag definitions and the actual tag attributes. Also notice that I’ve added the “?” before the “"\s*target="newwindow"section . This is done so the expression will catch the first instance of this occurrence, and not the last one (or it will match everything up to the last link in the end of the file instead of closing the match on the first match).

 

Getting our Title

Now that we have the link, we need to get the title for the link. This one is also relatively easy. The title resides between the Href’s closing tag (“>”) and the link’s closing tag (“</a>”). More things we need to consider along the way are new lines or spaces, so we take these into our regular expression as well.

Here’s the full expression so far. I’ve highlighted the new part:

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)

 

And we have a group in there as well, called “title” so we can refer to it later in code. Notice that the title is made up of any number of characters, followed by zero or more new lines and more characters.

 

 

Getting our description

The description is a block of text that can contain new lines, and is terminated by  a “<br>”:

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)

 

The end of the expression contains the beginning of the next expression we want to find.

 

Getting our category

The category of the current news item is usually “Article” or “Product Release”. It always start with the “>” sign and ends with a period (“.”):

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.

 

Getting our publishing date

The news date follows right after the category’s ending period (with zero or more spaces between them ) and finishes with another period, ending with the closing SPAN tag and P tag.

 

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span></p>)

 

Our final regex

So. We end up with a piece of text that we can you to scan .NetWire’s HTML, and retrieve a list of Matches, each of which contains groups named “link”,”title” etc. that we can use in our code. Our next step is to transform this pile of data into useful readable information

 

Making an RSS feed out of it

 

The first step in creating a valid RSS feed is to know how the RSS schema looks. There are several RSS standars out there today. I’ve chosen to implement this using the RSS 2.0 standard. I won’t bore you with the entire schema definition here, but a standard RSS feed using the RSS 2.0 schema should look l something like this:

 

  <?xml version="1.0" encoding="utf-8" ?>

<rss version="2.0" xmlns:blogChannel="http://backend.userland.com/blogChannelModule">

<channel>

</title>

</link>

</description>

</copyright>

</generator>

             <item>

                      </title>

                      </link>

</description>

</category>

</pubDate>

             </item>

</channel>

</rss>

 

 

The easiest way to write XML with the .Net framework is using the XMLTextWriter class. This class abstracts away the need to explicitly write strings that represent XML, and supports writing directly to a file or an IO.Stream object. That stream can represent either a file stream, a memory stream, a response stream or anything else that derives from System.IO.Stream. Pretty powerful.

Here’s a small method that gets all the matches from a site’s HTML, loops through them, and uses an XMLTextWriter to write the XML representing the RSS feed:

 

Public Sub WriteRSSToStream(ByVal txWriter As TextWriter)

 

'our pattern to parse the page

Const REGEX_PATTERN as string = "<p\s*class=""clsNormalText""><a\shref=""(?<link>.*)?(""\s*target=""newwindow"")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span>)"

 

'Get the HTML to parse

Dim DownloadedHtml As String = GetHtml()

'Get the matches using our regular expression

Dim found As MatchCollection = Regex.Matches(DownloadedHtml, REGEX_PATTERN)

Dim writer As New XmlTextWriter(txWriter)

 

With writer

    'make the resulting xml human readable

    .Formatting = Formatting.Indented

 

    'write the document header declaring rss version

    'and channel info

    .WriteStartDocument()

    .WriteComment("RSS generated by SiteToRSS generator at " _

+ DateTime.Now.ToString("r"))

    .WriteStartElement("rss")

    .WriteAttributeString("version", "2.0")

    .WriteAttributeString("xmlns:blogChannel", _

"http://backend.userland.com/blogChannelModule")

 

    .WriteStartElement("channel", "")

    .WriteElementString("title", RSSFeedName)

    .WriteElementString("link", RssFeedLink)

    .WriteElementString("description", RssFeedDescription)

    .WriteElementString("copyright", RssFeedCopyright)

    .WriteElementString("generator", "SiteParser RSS engine 1.0 by Roy Osherove")

 

    'write out the individual posts

    For Each aMatch As Match In found

        Dim link As String = aMatch.Groups("link").Value

        Dim title As String = aMatch.Groups("title").Value

        Dim description As String = aMatch.Groups("description").Value

 

      'format the date as RFC1123 date string (“Tue, 10 Dec 2002 22:11:29 GMT”)

        Dim pubDate As String = _

DateTime.Parse(aMatch.Groups("pubDate").Value).ToString("r")

        Dim subject As String = aMatch.Groups("category").Value

 

        .WriteStartElement("item")

 

        .WriteElementString("title", title)

        .WriteElementString("link", link)

 

      'The description may contain illegal chars

      ‘so write it our as CDATA

        .WriteStartElement("description")

        .WriteCData(description)

        .WriteEndElement()

     

        .WriteElementString("category", subject)

        .WriteElementString("pubDate", pubDate)

 

        .WriteEndElement()

 

     Next

 

     'close all open tags and finish up

     WriteEndDocument()

     Flush()

     Close()

End With

 

End Sub

 

The code to generate an RSS feed is surprisingly simple. After you create this XML file notice that the method accepts a TextWriter, which can potentially be a stream writing to a file, a string or lots of other things. We are not bound to any particular target in this implementation. I still haven’t shown how to get the actual HTML from the web, but I’ll explain shortly.

 

Validating our feed

To validate the feed as valid XML RSS, you can use one of the various free RSS validating sites out there (www.FeedValidator.org  pops to mind). The site will make sure your feed lived up to the standard it claims to support, and will tell you if you missed anything important.

It’s very helpful to test against such a site to make sure you don’t screw up people’s aggregators that will subscribe to your new feed.

 

Subscribing to our feed

Now that we have a ready made XML file, we can test it using a real aggregator. I used SharpReader and simply registered for a feed located at the path leading to the XML file. In SharpReader, I made sure that there are just the same number of posts as there are news items on the site, and that the titles are correct. Also I made sure that the “subject” column will correctly represent the “category” of each news item.

 

Approaches for a generic tool

Now that we have the basic mechanics of the thing working, we need to understand the power that comes from such a simple technique. What we’ve seen here demonstrates that given a simple regular expression and text to parse, we are  basically able to parse any site we wanted.

It comes to mind that we can build a simple class that receives these parameters which outputs RSS feeds appropriately.

Such a class can later be used to build a much more generic web site or web service, to which sites and expressions can be added dynamically, and that returns valid RSS feeds given a site ID.

But let’s start small.

 

Building the generic SiteToRSS Class

Our class should have several public properties representing the various RSS feed properties (description, generator and so on).

It should also be able to download a site from the web, and write an RSS feed into a file or just return it as a string.

I’ll spare you the entire code of the class, but I’ll refer here to the less trivial methods inside it. Here’s the basic layout:

 

 

 

Public Class RSSCreator

    Public Sub New(ByVal Url As String, ByVal FileName As String)

    End Sub

 

    Public Sub New(ByVal Url As String)

    End Sub

 

    Public Property UrlToParse() As String

    End Property

 

    ''' <summary>

    '''     the file to which the RSS feed will be written to

    ''' </summary>

    Public Property FileName() As String

    End Property

 

    ''' <summary>

    '''     returns a string containing the RSS feed xml

    ''' </summary>

    Public Overloads Function GetRss() As String

        Dim ms As New MemoryStream

        Dim sr As New StreamWriter(ms, Encoding.UTF8)

 

        'We send "false" to signal the method to not close the stream automatically in the end

        'we need to close the stream manually so we can get its length

        WriteRSS(sr, False)

        Try

 

            ''we need to explicitly state the length

            'of the buffer we want

            'otherwise we'll get a string as long as ms.capacity

            'instead of the actual length of the string inside

            Dim iLen As Long = ms.Length

            Dim retval As String = _

                Encoding.UTF8.GetString(ms.GetBuffer(), 0, iLen)

 

            sr.Close()

            Return retval

 

        Catch ex As Exception

            Return ex.ToString()

 

        End Try

 

    End Function

 

    ''' <summary>

    '''     writes the resolved RSS feed to a file

    ''' </summary>

    Public Overloads Function WriteRSS() As String

        Dim writer As New StreamWriter(FileName, False, Encoding.UTF8)

        Return WriteRSS(writer, True)

    End Function

 

    ''' <summary>

    '''     Writes the resolved RSS feed to a text writer

    '''     and returns the text that was written (if it was written to a file)

    ''' </summary>

    Public Overloads Function WriteRSS(ByVal txWriter As TextWriter, ByVal closeAfterFinish As Boolean) As String

 

    End Function

 

    ''' <summary>

    '''     writes the beggining of the XML document

    ''' </summary>

    Private Sub WritePrologue(ByVal writer As XmlTextWriter)

        With writer

            .WriteStartDocument()

            .WriteComment("RSS generated by SiteToRSS generator at " + DateTime.Now.ToString("r"))

            .WriteStartElement("rss")

            .WriteAttributeString("version", "2.0")

            .WriteAttributeString("xmlns:blogChannel", "http://backend.userland.com/blogChannelModule")

 

            .WriteStartElement("channel", "")

            .WriteElementString("title", RSSFeedName)

            .WriteElementString("link", RssFeedLink)

            .WriteElementString("description", RssFeedDescription)

            .WriteElementString("copyright", RssFeedCopyright)

            .WriteElementString("generator", "SiteParser RSS engine 1.0 by Roy Osherove")

        End With

    End Sub

 

 

    ''adds a post to the RSS feed

    Private Sub AddRssItem(ByVal writer As XmlTextWriter, ByVal title As String, ByVal link As String, ByVal description As String, ByVal pubDate As String, ByVal subject As String)

 

        writer.WriteStartElement("item")

        writer.WriteElementString("title", title)

        writer.WriteElementString("link", link)

 

        'write the description as CDATA because

        'it might contain invalid chars

        writer.WriteStartElement("description")

        writer.WriteCData(description)

        writer.WriteEndElement()

 

        writer.WriteElementString("category", subject)

        writer.WriteElementString("pubDate", pubDate)

        writer.WriteEndElement()

 

    End Sub

 

    ''' <summary>

    '''     generates a new regular expression

    '''     and retrives the GTML from thw web

    ''' </summary>

    Private Sub ParseHtml()

        m_FoundRegex = New Regex(RegexPattern)

        GetHtml()

 

    End Sub

 

 

    ''' <summary>

    '''     retrieves the web page form the web

    ''' </summary>

    Private Sub GetHtml()

    End Sub

 

    Public Property DownloadedHtml() As String

    End Property

 

    ''' <summary>

    '''     this prefix will be prepended to every news item link

    ''' </summary>

    Public Property LinksPrefix() As String

    End Property

 

    Public Property RegexPattern() As String

        Get

            Return m_strRegexPattern

        End Get

        Set(ByVal Value As String)

            'important!

            'We need to verify this or we won't have a viable feed

            VerifyPatternIsGood(Value)

            m_strRegexPattern = Value

        End Set

    End Property

 

    ''' <summary>

    '''     verify that the required group names appear

    '''     in the regular expression passed to the parsing engine

    ''' </summary>

    Private Sub VerifyPatternIsGood(ByVal pattern As String)

    End Sub

 

    ''' <summary>

    '''     usees a reges to determine if a certain named group

    '''     exists within another regex string.

    '''     If not, an exception is thrown.

    ''' </summary>

    Private Sub VerifyPatternIsGood(ByVal pattern As String, ByVal NeededGroupName As String)

    End Sub

 

    Public Property RssFeedDescription() As String

    End Property

 

    Public Property RssFeedLink() As String

    End Property

 

    Public Property RSSFeedName() As String

    End Property

 

 

    Public Property RssFeedCopyright() As String

    End Property

 

End Class

 

 

 

 

The class itself is very simple to use. You instantiate it with a URL and possibly a file name, and then set its properties which will reflect the feed properties. Several actions need special attention:

 

Verifying the existence of capture groups in a pattern

 

When setting the “RegexPattern” property, the class runs an internal check to verify that the entered regex contains all the group names that are expected in order to successfully write the RSS feed, To this end, it calls the “VerifyPatternIsGood()” method, which internally calls an overload of itself with each required group name. This overload actually runs a match on the expression using its own regular expression, to check that the passed group name is indeed inside the pattern text. Kinda like performing brain surgery on yourself…

Here is the code for these two methods.

 

 

    ''' <summary>

    '''     verify that the required group names appear

    '''     in the regular expression passed to the parsing engine

    ''' </summary>

    Private Sub VerifyPatternIsGood(ByVal pattern As String)

        Try

            VerifyPatternIsGood(pattern, "description")

            VerifyPatternIsGood(pattern, "title")

            VerifyPatternIsGood(pattern, "link")

            VerifyPatternIsGood(pattern, "category")

            VerifyPatternIsGood(pattern, "pubDate")

        Catch ex As Exception

            Throw ex

        End Try

    End Sub

 

    ''' <summary>

    '''     usees a reges to determine if a certain named group

    '''     exists within another regex string.

    '''     If not, an exception is thrown.

    ''' </summary>

    Private Sub VerifyPatternIsGood(ByVal pattern As String, ByVal NeededGroupName As String)

        Dim VerifyRegex As String = "\?<" & NeededGroupName & ">"

 

        If Not Regex.Match(pattern, VerifyRegex).Success Then

            Throw New ArgumentException(NeededGroupName & " group missing form pattern")

        End If

 

    End Sub

 

 

 

Retrieving the site using the WebClient class

The class is responsible of retrieving the site’s HTML content from the web. To that end, it uses the WebClient class, which allows us to, oh so easily, download web pages, download or upload files, post requests and lots of other cool stuff.

The method that does this work is the GetHtml() method:

 

    ''' <summary>

    '''     retrieves the web page form the web

    ''' </summary>

    Private Sub GetHtml()

        Try

            Dim req As New WebClient

            Dim reader As New StreamReader(req.OpenRead(UrlToParse))

            Me.DownloadedHtml = reader.ReadToEnd()

 

            reader.Close()

            req.Dispose()

 

        Catch oE As System.Exception

        End Try

    End Sub

 

 

Writing the RSS feed to either a file or an in-memory stream

We want our class to have the ability to either write the xml it creates to a file or to a string which will be returned back. Why would we want to build an In memory string rather than a file? If we were to create an .aspx page that returns that XML, it is much easier to build a string in memory than to write that xml to a file.  That’s because writing to a file from ASP.NET not only requires more security permissions, but you also have to deal with situations where the file might be accessed concurrently from multiple sessions, A task which is more of a hassle than a blessing for our needs.

So a web client will want to call the GetRss() method, which would return a string, and a win forms client might want to write to a file, to which end it could call the  WriteRss() method. Internally though, both methods refer to the same internal implementation, an overloaded WriteRss() method which accepts a TextWriter object to which it will write out the XML. The difference is that the GetRss() method will call this method with a textWriter that sits on top of a MemoryStream object, while the GetRss() method will call it with a TextWriter that sits on top of a StreamWriter to a direct file. Here’s the code for the generic WriteRssMethod:

 

    ''' <summary>

    '''     Writes the resolved RSS feed to a text writer

    '''     and returns the text that was written (if it was written to a file)

    ''' </summary>

    Public Overloads Function WriteRSS(ByVal txWriter As TextWriter, ByVal closeAfterFinish As Boolean) As String

 

        ParseHtml()

        Dim found As MatchCollection = m_FoundRegex.Matches(DownloadedHtml)

        Dim tr As New StringWriter

 

        Dim writer As New XmlTextWriter(txWriter)

        writer.Formatting = Formatting.Indented

 

        WritePrologue(writer)

 

        'write the individual news items

        For Each aMatch As Match In found

            Dim link As String = LinksPrefix & aMatch.Groups("link").Value

            Dim title As String = aMatch.Groups("title").Value

            Dim description As String = aMatch.Groups("description").Value

            Dim pubDate As String = DateTime.Parse(aMatch.Groups("pubDate").Value).ToString("r")

            Dim subject As String = aMatch.Groups("category").Value

 

            AddRssItem(writer, title.Trim(), link, description, pubDate, subject)

        Next

 

        ''finish all tags

        writer.WriteEndDocument()

        writer.Flush()

 

        Dim strResult As String = String.Empty

 

        If closeAfterFinish Then

            writer.Close()

 

            'return the result that was written

            'if this was written to a file

            Try

                Dim sr As StreamReader = File.OpenText(FileName)

                strResult = sr.ReadToEnd()

                sr.Close()

 

            Catch ex As Exception

            End Try

        End If

 

        Return strResult

    End Function

 

 

One thing of note in this method is that it has a flag that tells it whether it should close the text writer after finishing writing to it. This implementation details is important because when writing to a memory stream, the Calling method will want to keep the stream open after the call, so that it can then retrieve the text inside that stream and return it as the result.

For the other function, which writes to a file, this is of no importance, so the flag is passed as true. The WriteRss() method reads the xml file that was written and returns that result string.

 

Working with a MemoryStream and the case for Xml encoding

Writing the XML directly to a string in-memory proved to be rather tricky. The XmlTextWriter has 3 contructors:

·         XmlTextWriter(fileName as String,encoding as System.Text.Encoding)

·         XmlTextWriter(w as System.IO.Stream, encoding as System.Text.Encoding)

·         XmlTextWriter(w as System.IO.TextWriter)

 

In our case, it is very important that we are able to specify the encoding manually. I’ll explain why.

When I first approached writing to an in-memory string using the XmlTextWriter, my initial instinct guided me towards creating a StringWriter object, to send to the ctor #3 on our list. So , a call would look something like this:

 

Dim sb as new StringBuilder()

Dim writer as XmlTextWriter  = new XmlTextWriter(new StringWriter(sb))

‘write the xml using the writer

..

Writer.Close()

Return sb.ToString()

 

There’s a major problem with this code, although it seems to work perfectly. The problem is that internally, all strings are represents as UTF-16 encoded strings. As a result, the StringBuilder object, and the StringWriter object output XML that contains an encoding tag with an encoding of type UTF-16. This is a problem, because if we want our xml string to be parse-able  using our RSS readers, this XML needs to be encoded as a UTF-8 string. Using the constructor as shown above does not give us the option, not even after the declaration, to change the encoding with which the XML will be encoded as.

 

So, to solve this we are left with the other two constructors, that do allow us to specify the encoding of the output. The first ctor, that accepts a file name needs no explanation, but the second one is the one we use to write to an in-memory string, and is the one that caused the most problems.

 

Here’s the “GetRss()” method again:

 

 

Public Overloads Function GetRss() As String

        Dim ms As New MemoryStream

        Dim sr As New StreamWriter(ms, Encoding.UTF8)

 

        'We send "false" to signal the method to not close the stream automatically in the end

        'we need to close the stream manually so we can get its length

        WriteRSS(sr, False)

        Try

 

            ''we need to explicitly state the length

            'of the buffer we want

            'otherwise we'll get a string as long as ms.capacity

            'instead of the actual length of the string inside

            Dim iLen As Long = ms.Length

            Dim retval As String = _

                Encoding.UTF8.GetString(ms.GetBuffer(), 0, iLen)

 

            sr.Close()

            Return retval

 

        Catch ex As Exception

            Return ex.ToString()

 

        End Try

 

    End Function

 

 

Because we can’t use a StringWriter object to send to WriteRss(), we’re left with the option of sending in a StreamWriter. We need to initialize it with a stream that is written in memory – MemoryStream. The first part of the method is rather easy – we initialize the StreamWriter and send it over to WritreRss. The trouble begins after that. How do you retrieve the text inside a MemoryStream? Well, conveniently enough, it has a GetBuffer() method, which returns the byte array of the stream’s contents. Easy enough, we now just need to transform this byte array into a string encoded as UTF-8. To that end I used the System.Text.Encoding.UTF8.GetEncoding(byte()) method which does exactly that.

System.Text.Encoding.UTF8.GetEncoding() has two overloads:

·         System.Text.Encoding.UTF8.GetEncoding(array as byte())

·         System.Text.Encoding.UTF8.GetEncoding(array as byte(),index as Integer, count as Integer)

 

Why am I using the more complex version then? It seems perfectly reasonable to call the first one, right? Wrong. When I used the first version of the method, my resulting XML string contained garbage at the end of it. It contained mostly of garbaled and mangles spaced which wreaked havoc in IE trying to view it, but it seems to work fine using my aggregator. However, the feed failed to validate using www.FeedValidator.org . It said something about a “Missing token” and pointed at the last line of the XML. I could not figure it and was literally stuck. In fact, I didn’t even think about trying to use the other overload of this method, until I got a little help from Mike Gunderloy. The problem was that the GetBuffer() method of the stream returns the contents of the stream in full (up to the length of the Stream.Capacity property). Since streams in the CLR use a paging mechanism, the capacity of a stream is almost always larger than the current contents of the stream, so the resulting output is contents + garbage. Using the overloaded method, I could get the exact contents of the stream by specifying the length of the stream as the length of the output I want to extract from it. The length property contains the actual length of the data.

This is also why I had to signal my WriteRss() implementation to not close the stream after finishing the writing. Had I closed the stream, I could not have gotten the length of the current stream, which I needed to retrieve the results.

 

Link prefix

One property in the class needs to be explained. LinksPrefix will contain the prefix prepended to each news item link that will be discovered. Notice when harvesting the HTML, that all links in the site are not “full” links, but usually “partial” links, pointing to some place in the same site. In cases such as these (and .netWire is) we want to specify the LinksPrefix as Http://www.DotNetWire.com , to make the links of the news items “full” again.

 

RFC date formats

RSS 2.0 requires a publish date formatted as an RFC822 date. We are using the RFC1123 format which seems to return essentially the same result.

Using the generic class with .NetWire

Ah, it’s ready. Let’s use it!

Here’s a simple code that can now use the class to parse .NetWire and return an XML RSS feed from it:

 

  Dim rss As RSSCreator.RSSCreator = _

New RSSCreator.RSSCreator("http://www.dotnetwire.com")

 

            With rss

                .LinksPrefix = rss.UrlToParse

                .RegexPattern = "<p\s*class=""clsNormalText""><a\shref=""(?<link>.*)?(""\s*target=""newwindow"")(.|\n)*?>(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span>)"

 

                .RSSFeedName = "The unofficial .NetWire RSS feed"

                .RssFeedLink = "http://www.DotNetWire.com"

                .RssFeedDescription = "A basic feed that parses the .NetWire site"

                .RssFeedCopyright = "Copyright 2003 Roy Osherove"

 

                return .GetRss()

            End With

 

What’s in the download?

The download contains several projects:

·         RssCreator : A library with the source for the RSSCreator class

·         MakeRss: A simple ASP.Net project that retrieves a feed for the .NetWire site

·         SiteToRss: a simple winform application that represents a simple utility to test various sites with regular expressions

 

 

Download the source files.

[New article]: Creating a generic Site-To-RSS tool (site scraping)

nsh - A tool to run .NET source files like script files.