ASPAlliance.com : The #1 Active Server Pages .NET Community The #1 ASP.NET Community
Search   Search

Subscribe   Subscribe

Powered by ORCSWeb Hosting


Site Stats


Powered By ASP.NET
 
Featured Sponsor

Featured Columnist


Featured Book
Professional ASP.NET
Professional ASP.NET

Find Prices
Read Review
Sample Chapter


New! asp.netPRO

We publish our articles in the standard RSS format.

Powerful .NET Email Component

Code Sharing Software
Andy's ASP Tips 'n' Tricks
Back To Articles
Home grown RSS Harvester & Aggregator
With the abundance and diversity of RSS News feeds out in the wild now thanks to sites like Moreover.com, the attraction of getting free syndicated news onto my site was getting too hard to resist. Unfortunately like most other small sites out on the net I am hosting on a shared server so I dont have access to all the scheduling tools and there is no chance of getting any COM objects installed, so I had to think round this problem.

As Im getting a steady stream of hits throughout the day I decided to place the harvester on the default page of my site then get it to run roughly every 4 hours using the Harvested field in the RSSHarvested table from the database

First of all behind everything is a small Database with the following 2 tables

RSSHarvested
ID Harvested
1 28/05/02 08:38:13


RssLocation
ID RssURL Name
1 http://www.total-metal.co.uk/dm2/tmrss.asp TotalMetal


These are the core of the Harvester controlling when it should run and the location of the RSS to be harvested

The main issue with the Aggregator is speed. So I decided it would be better to physically download the RSS files onto the server using the MSXML2.ServerXMLHTTP Object.

Part 1. The Harvester
First of all create the connection to the database and declare the Constants for the File System Object

set conn = server.createobject("ADODB.Connection")
connStr = "DRIVER={SQL Server}; Server=MYSERVER; UID=MYUID; PWD=MYPWD"
Conn.Open connStr

Const fsoForReading = 1
Const fsoForWriting = 2
Const fsoForAppend = 8


I then created a function that grabbed the XML from the web and returned it as a string
Function GetXML(strURL)
     Dim objXMLHTTP, strReturn
     Set objXMLHTTP = Server.CreateObject("MSXML2.ServerXMLHTTP")

     On Error Resume Next
     objXMLHTTP.Open "GET", strURL, false
     call objXMLHTTP.Send()

     GetXML = objXMLHTTP.ResponseText

      Set objXMLHTTP = Nothing
End Function


Next we have to check when the Harvester last ran
I have hard coded the processor to run roughly every 4 hours using the DateDiff function
Set rsTimeCheck = server.CreateObject("ADODB.Recordset")

with rsTimeCheck
     .CursorType = 2
     .CursorLocation = 2
     .LockType = 2
     .ActiveConnection = conn
     .Source = "SELECT Harvested FROM RSSHarvested"
     .Open
end with

StrDiffHours = datediff("h", rsTimeCheck("Harvested") , now())

if Cint(StrDiffHours) => 4 then


If the difference between Harvested and now() is greater than 4 then we need to run the code to download the RSS File onto the web server.
First of all we need to get the locations of the RSS files and the names we want to call the downloaded files from the database
Set rs = server.createobject("ADODB.Recordset")
with rs
     .cursortype = 3
     .cursorlocation = 3
     .locktype = 3
     .activeconnection = conn
     .Source = "Select RssUrl, Name from RssLocation"
     .open
end with


I also decided to remove the DOCTYPE from the RSS file for ease of processing once they have been saved onto my server. I did this by using Regular Expressions.
Set objRegExp = New Regexp

objRegExp.IgnoreCase = True
objRegExp.Global = True


I created a Pattern for the regular expression to find the DOCTYPE info ready to remove it later on in the program
objRegExp.Pattern = "<!DOCTYPE(.|\n)+?>"


Before we begin to download the RSS we need to create an instance of the File System Object
Set objFSO = Server.CreateObject("Scripting.FileSystemObject")


Now we are ready to loop through our record set and begin to download the RSS files
do while not rs.eof
     on error resume next


Call the GetXML function we created earlier and make sure the output is in a string then replace the pattern we defined in the Regular expression with nothing.
TextStr = cStr(GetXML(rs("RssUrl")))

TextStr = objRegExp.Replace(TextStr, "")


Now we have the RSS file in a string ready to save it onto the Hard Drive of the server.
We now need to define the paths of the existing file and somewhere to save the new RSS File. I created a temp directory to save these files in so we can compare the new file against the existing file for differences later on in the program
FilePath = "D:\WebDrive\MyDir\rss\" & rs("Name") & ".xml"

TempPath = "D:\WebDrive\MyDir\rss\temp\" & rs("Name") & ".xml"


Now we have the paths we need to save the new file into the temp directory then open the 2 files into a string using the TextStream Object
Set ObjTextStream = objFSO.CreateTextFile(TempPath, true, false)
ObjTextStream.writeline TextStr
ObjTextStream.close

Set objTextStream1 = objFSO.OpenTextFile(FilePath, fsoForReading)
StrOriginalFile = objTextStream1.ReadAll
ObjTextstream1.close

Set objTextStream2 = objFSO.OpenTextFile(TempPath, fsoForReading)
StrTempFile = objTextStream2.ReadAll
ObjTextStream2.close


Now we have the 2 files in strings we can compare them for differences using the StrComp function. If they are different then copy the temp file over the existing file then remove the temp file as it is not needed any more
if StrComp(StrOriginalFile,StrTempFile) <> 0 then
     objFSO.CopyFile TempPath, "D:\WebDrive\MyDir\rss\"
end if

ObjFSO.DeleteFile TempPath, False


Next to close the loop and Objects
rs.movenext
loop

ObjFSO.Close

rs.close
set rs = nothing


Finally we need to change the date / time the files were updated in the database
Set rsUpd = server.CreateObject("ADODB.Recordset")
with rsUPD
     ActiveConnection = conn
     .CursorType = 2
     .LockType = 2
     .Source = "SELECT Harvested FROM RSSHarvested"
     .Open
          rsUPD("Harvested") = now()
     .update
     .close
end with


Now this is complete all we have left to do is to close the IF statement and any objects we have left open
end if

Set objRegExp = Nothing

rsTimeCheck.Close
set rsTimeCheck = Nothing

conn.close
set conn = nothing


Hey presto. Freshly harvested RSS files on your server ready for manipulation in Part 2

Below is the complete code for the Harvester (default.asp)

set conn = server.createobject("ADODB.Connection")
connStr = "DRIVER={SQL Server}; Server=MYSERVER; UID=MYUID; PWD=MYPWD"
Conn.Open connStr

Const fsoForReading = 1
Const fsoForWriting = 2
Const fsoForAppend = 8

Function GetXML(strURL)
      Dim objXMLHTTP, strReturn
      Set objXMLHTTP = Server.CreateObject("MSXML2.ServerXMLHTTP")

     On Error Resume Next
     objXMLHTTP.Open "GET", strURL, false

     call objXMLHTTP.Send()
     GetXML = objXMLHTTP.ResponseText
     Set objXMLHTTP = Nothing
End Function

Set rsTimeCheck = server.CreateObject("ADODB.Recordset")
with rsTimeCheck
     .CursorType = 2
     .CursorLocation = 2
     .LockType = 2
     .ActiveConnection = conn
     .Source = "SELECT Harvested FROM RSSHarvested"
     .Open
end with

StrDiffHours = datediff("h", rsTimeCheck("Harvested") , now())

if Cint(StrDiffHours) => 4 then

     Set rs = server.createobject("ADODB.Recordset")
     with rs
          .cursortype = 3
          .cursorlocation = 3
          .locktype = 3
          .activeconnection = conn
          .Source = "Select RssUrl, Name from RssLocation"
          .open
     end with

     Set objRegExp = New Regexp
     objRegExp.IgnoreCase = True
     objRegExp.Global = True
     objRegExp.Pattern = "<!DOCTYPE(.|\n)+?>"
     Set objFSO = Server.CreateObject("Scripting.FileSystemObject")

     do while not rs.eof
          on error resume next

          TextStr = cStr(GetXML(rs("RssUrl")))
          TextStr = objRegExp.Replace(TextStr, "")

          FilePath = "D:\WebDrive\MyDir\rss\" & rs("Name") & ".xml"
          TempPath = "D:\WebDrive\MyDir\rss\temp\" & rs("Name") & ".xml"

          Set ObjTextStream = objFSO.CreateTextFile(TempPath, true, false)
          ObjTextStream.writeline TextStr
          ObjTextStream.close

          Set objTextStream1 = objFSO.OpenTextFile(FilePath, fsoForReading)
          StrOriginalFile = objTextStream1.ReadAll
          ObjTextstream1.close

          Set objTextStream2 = objFSO.OpenTextFile(TempPath, fsoForReading)
          StrTempFile = objTextStream2.ReadAll
          ObjTextStream2.close

          StrCompare = StrComp(StrOriginalFile,StrTempFile)

          if StrCompare <> 0 then
               objFSO.CopyFile TempPath, "D:\WebDrive\MyDir\rss\"
          end if

          ObjFSO.DeleteFile TempPath, False

     rs.movenext
     loop

     ObjFSO.Close
     rs.close
     set rs = nothing

     Set rsUpd = server.CreateObject("ADODB.Recordset")
     with rsUPD
          .ActiveConnection = conn
          .CursorType = 2
          .LockType = 2
          .Source = "SELECT Harvested FROM RSSHarvested"
          .Open
               rsUPD("Harvested") = now()
          .update
          .close
     end with

end if

Set objRegExp = Nothing

rsTimeCheck.Close
set rsTimeCheck = Nothing

conn.close
set conn = nothing


Part 2 The Aggrigator
Now we have freshly harvested RSS files on our server we need to aggregate and output them for the users to view
I have a version of this running on my home page (http://www.total-metal.co.uk) in the Other News (external sites) section at the bottom of the page

First of all we need to open the connection to the server and create a record set containing the names of the RSS files on our server
set conn = server.createobject("ADODB.Connection")
connStr = "DRIVER={SQL Server}; Server=MYSERVER; UID=MYUID; PWD=MYPWD"
Conn.Open connStr

set rs = server.createobject("ADODB.Recordset")
with rs
     .cursorType = 3
     .LockType = 3
     .CursorLocation = 3
     .ActiveConnection = conn
     .Source = "Select Name from RSSLocation"
     .open
end with


Next I decided that it would be easier to manipulate the RSS files if I loaded them into a Disconnected Record set containing
  • UID
  • LastBuildDate
  • DateDiff
  • Title
  • ItemTitle
  • ItemLink
Then sort them using ADO filters. In order to do this we need to define our data types for the Disconnected Record set then Create the Disconnected Record Set
Const adVarChar = 200
Const adInt = 4

Set DRS = server.CreateObject("ADODB.RECORDSET")

DRS.Fields.append "UID", adInt,4
DRS.Fields.append "LastBuildDate",adVarChar,200
DRS.Fields.append "DateDiff", adInt,4
DRS.Fields.append "Title", adVarChar,200
DRS.Fields.append "ItemTitle", adVarChar,200
DRS.Fields.append "ItemLink", adVarChar,200
DRS.open


Next we need to create an instance of the File system object and start a loop through the record set with a counter on it for our UID field in the Disconnected Record Set
set ObjFSO = server.CreateObject("Scripting.filesystemobject")

Counter = 1
do while not rs.eof
on error resume next


Next we need to define where our source file and the XSL style sheet are for when we transform the RSS later in the program
SourceFile = server.MapPath("rss/" & rs("name") & ".xml")
Stylefile = server.MapPath("rss/AggStyle.xsl")


Next we need to create an XML Object and an XSL Object using the Microsoft.XMLDOM
Set XML = server.CreateObject("Microsoft.XMLDOM")
XML.async = false
XML.load(SourceFile)

Set XSL = server.CreateObject("Microsoft.XMLDOM")
XSL.async = false
XSL.load(Stylefile)


Below is the XSL Style sheet in the XSL Object we will use to transform the RSS (AggStyle.xsl)
<?xml version="1.0"?>
<xsl:template xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:value-of select="rss/channel/lastBuildDate"/>|<xsl:value-of select="rss/channel/title"/>|<xsl:value-of select="rss/channel/item/title"/>|<xsl:value-of select="rss/channel/item/link"/>
</xsl:template>


The Style sheet will return the data in the XML as a pipe ( | ) separated string
Now we have the XML and the XSL Objects we need to transform the XML against the XSL and split the pipe ( | ) separated string into an Array
ArrXML = split(XML.transformNode(XSL), "|")


Also we need to get the date the Source file was last modified for the DateDiff function we will be using later on
Set ObjFile = ObjFSO.GetFile(SourceFile)
StrDate = ObjFile.DateLastModified


Nows the time to add the contents of the Array into the Disconnected Record Set then Close the XML and XSL Objects
     DRS.AddNew
     DRS("UID") = Counter
     DRS("LastBuildDate") = StrDate
     DRS("DateDiff") = DateDiff("d", StrDate, date())
     DRS("Title") = ArrXML(1)
     DRS("ItemTitle") = ArrXML(2)
     DRS("ItemLink") = ArrXML(3)
     DRS.Update

     Set XML = Nothing
     Set XSL = Nothing


Now we need to increment the counter by 1, close the loop and close any objects we no longer need
Counter = Counter + 1
rs.MoveNext
loop

rs.close
set rs = nothing
set ObjFile = nothing
Set objFSO = nothing
Conn.close
Set Conn = nothing


Now we have populated the Disconnected Record Set we need to sort it and place some form of filtering on it. We achieve this by using ADO Filters and Sorts
The filter is used too weed out any bad records. There must be a title and the DateDiff has to be equal or above 0.
The Sort will order the records ascending. This is similar to the Order by clause in SQL
DRS.MoveFirst
DRS.Filter = "DateDIff >= 0 and Title <> ''"
DRS.Sort = "DateDiff"


Now it is almost time to display the aggregator we need to output some HTML
Response.Write "<table width=""75%"" border=""0"" cellspacing=""0"" ID="Table1">"


Now all that is left to do is loop through the Disconnected Record Set and display the aggregated RSS onto screen
Do while not DRS.EOF
%>
<tr>
     <td width="200" align="right" valign="top"><%=DRS("Title")%> - </td>
     <td> <a href="&lt;%=DRS("ItemLink")%>"><%=DRS("ItemTitle")%></a></td>
</tr>

<%DRS.MoveNext
Loop

DRS.Close
set DRS = Nothig
%>
</table>


Hey presto aggregated, Syndicated RSS news feeds fresh tasty and hot on your web page

Happy Programming
Andy Barker

Below is the complete code for the aggregator (aggregator.asp)
set conn = server.createobject("ADODB.Connection")
connStr = "DRIVER={SQL Server}; Server=MYSERVER; UID=MYUID; PWD=MYPWD"
Conn.Open connStr

set rs = server.createobject("ADODB.Recordset")
with rs
     .cursorType = 3
     .LockType = 3
     .CursorLocation = 3
     .ActiveConnection = conn
     .Source = "Select Name from RSSLocation"
     .open
end with

Const adVarChar = 200
Const adInt = 4

Set DRS = server.CreateObject("ADODB.RECORDSET")
DRS.Fields.append "UID", adInt,4
DRS.Fields.append "LastBuildDate",adVarChar,200
DRS.Fields.append "DateDiff", adInt,4
DRS.Fields.append "Title", adVarChar,200
DRS.Fields.append "ItemTitle", adVarChar,200
DRS.Fields.append "ItemLink", adVarChar,200
DRS.open

set ObjFSO = server.CreateObject("Scripting.filesystemobject")

Counter = 1
do while not rs.eof
     on error resume next
     SourceFile = server.MapPath("rss/" & rs("name") & ".xml")
     Stylefile = server.MapPath("rss/AggStyle.xsl")

     Set XML = server.CreateObject("Microsoft.XMLDOM")
     XML.async = false
     XML.load(SourceFile)

     Set XSL = server.CreateObject("Microsoft.XMLDOM")
     XSL.async = false
     XSL.load(Stylefile)

     Set ObjFile = ObjFSO.GetFile(SourceFile)
     StrDate = ObjFile.DateLastModified

     ArrXML = split(XML.transformNode(XSL), "|")
          DRS.AddNew
          DRS("UID") = Counter
          DRS("LastBuildDate") = StrDate
          DRS("DateDiff") = DateDiff("d", StrDate, date())
          DRS("Title") = ArrXML(1)
          DRS("ItemTitle") = ArrXML(2)
          DRS("ItemLink") = ArrXML(3)
          DRS.Update

     Set XML = Nothing
     Set XSL = Nothing
     Counter = Counter + 1
rs.MoveNext
loop

rs.close
set rs = nothing

set ObjFile = nothing
Set objFSO = nothing

DRS.MoveFirst
DRS.Filter = "DateDIff >= 0 and Title <> ''"
DRS.Sort = "DateDiff"

Response.Write "&lt;table width=""75%"" border=""0"" cellspacing=""0"">"

Do while not DRS.EOF
%>
<tr>
     <td width="200" align="right" valign="top"><%=DRS("Title")%> - </td>
     <td> <a href="&lt;%=DRS("ItemLink")%>"><%=DRS("ItemTitle")%></a></td>
</tr>

<%DRS.MoveNext
Loop

DRS.Close
set DRS = Nothig
%>
</table>
 Copyright © 2000-2003 ASPAlliance.com  Page Rendered at 12/3/2008 3:14:30 AM