|
Home grown RSS Harvester & Aggregator
With
the abundance and diversity of RSS News feeds out in the wild now thanks to
sites like Moreover.com, the attraction of getting free syndicated news onto my
site was getting too hard to resist. Unfortunately like most other small sites
out on the net I am hosting on a shared server so I dont have access to all
the scheduling tools and there is no chance of getting any COM objects
installed, so I had to think round this problem.
As Im getting a steady stream of hits throughout the day I decided to place
the harvester on the default page of my site then get it to run roughly every 4
hours using the Harvested field in the RSSHarvested table from the database
First of all behind everything is a small Database with the following 2 tables
| RSSHarvested |
| ID |
Harvested |
| 1 |
28/05/02 08:38:13 |
| RssLocation
|
| ID |
RssURL |
Name |
| 1 |
http://www.total-metal.co.uk/dm2/tmrss.asp |
TotalMetal |
These are the core of the Harvester controlling when it should run and the
location of the RSS to be harvested
The main issue with the Aggregator is speed. So I decided it would be better to
physically download the RSS files onto the server using the MSXML2.ServerXMLHTTP
Object.
Part 1. The Harvester
First of all create the connection to the database and declare the Constants
for the File System Object
set conn =
server.createobject("ADODB.Connection")
connStr = "DRIVER={SQL Server}; Server=MYSERVER; UID=MYUID; PWD=MYPWD"
Conn.Open connStr
Const fsoForReading = 1
Const fsoForWriting = 2 Const fsoForAppend = 8
|
I then created a function that grabbed the XML from the web and returned it as
a string
Function GetXML(strURL)
Dim objXMLHTTP, strReturn
Set objXMLHTTP =
Server.CreateObject("MSXML2.ServerXMLHTTP")
On Error Resume Next
objXMLHTTP.Open "GET", strURL, false
call objXMLHTTP.Send()
GetXML = objXMLHTTP.ResponseText
Set objXMLHTTP = Nothing End Function
|
Next we have to check when the Harvester last ran
I have hard coded the processor to run roughly every 4 hours using the DateDiff
function
Set rsTimeCheck =
server.CreateObject("ADODB.Recordset")
with rsTimeCheck
.CursorType = 2
.CursorLocation = 2
.LockType = 2
.ActiveConnection = conn
.Source = "SELECT Harvested FROM RSSHarvested"
.Open
end with
StrDiffHours = datediff("h", rsTimeCheck("Harvested") , now())
if Cint(StrDiffHours) => 4 then
|
If the difference between Harvested and now() is greater than 4 then we need to
run the code to download the RSS File onto the web server.
First of all we need to get the locations of the RSS files and the names we
want to call the downloaded files from the database
Set rs =
server.createobject("ADODB.Recordset")
with rs
.cursortype = 3
.cursorlocation = 3
.locktype = 3
.activeconnection = conn
.Source = "Select RssUrl, Name from RssLocation"
.open end with
|
I also decided to remove the DOCTYPE from the RSS file for ease of processing
once they have been saved onto my server. I did this by using Regular
Expressions.
Set objRegExp = New Regexp
objRegExp.IgnoreCase = True objRegExp.Global = True
|
I created a Pattern for the regular expression to find the DOCTYPE info ready
to remove it later on in the program
| objRegExp.Pattern =
"<!DOCTYPE(.|\n)+?>"
|
Before we begin to download the RSS we need to create an instance of the File
System Object
| Set objFSO =
Server.CreateObject("Scripting.FileSystemObject")
|
Now we are ready to loop through our record set and begin to download the RSS
files
do while not rs.eof on
error resume next
|
Call the GetXML function we created earlier and make sure the output is in a
string then replace the pattern we defined in the Regular expression with
nothing.
TextStr =
cStr(GetXML(rs("RssUrl")))
TextStr = objRegExp.Replace(TextStr, "")
|
Now we have the RSS file in a string ready to save it onto the Hard Drive of
the server.
We now need to define the paths of the existing file and somewhere to save the
new RSS File. I created a temp directory to save these files in so we can
compare the new file against the existing file for differences later on in the
program
FilePath =
"D:\WebDrive\MyDir\rss\" & rs("Name") & ".xml"
TempPath = "D:\WebDrive\MyDir\rss\temp\" & rs("Name") & ".xml"
|
Now we have the paths we need to save the new file into the temp directory then
open the 2 files into a string using the TextStream Object
Set ObjTextStream =
objFSO.CreateTextFile(TempPath, true, false)
ObjTextStream.writeline TextStr
ObjTextStream.close
Set objTextStream1 = objFSO.OpenTextFile(FilePath, fsoForReading)
StrOriginalFile = objTextStream1.ReadAll
ObjTextstream1.close
Set objTextStream2 = objFSO.OpenTextFile(TempPath, fsoForReading)
StrTempFile = objTextStream2.ReadAll ObjTextStream2.close
|
Now we have the 2 files in strings we can compare them for differences using
the StrComp function. If they are different then copy the temp file over the
existing file then remove the temp file as it is not needed any more
if
StrComp(StrOriginalFile,StrTempFile) <> 0 then
objFSO.CopyFile TempPath,
"D:\WebDrive\MyDir\rss\"
end if
ObjFSO.DeleteFile TempPath, False
|
Next to close the loop and Objects
rs.movenext
loop
ObjFSO.Close
rs.close set rs = nothing
|
Finally we need to change the date / time the files were updated in the
database
Set rsUpd =
server.CreateObject("ADODB.Recordset")
with rsUPD
ActiveConnection = conn
.CursorType = 2
.LockType = 2
.Source = "SELECT Harvested FROM RSSHarvested"
.Open
rsUPD("Harvested")
= now()
.update
.close end with
|
Now this is complete all we have left to do is to close the IF statement and
any objects we have left open
end if
Set objRegExp = Nothing
rsTimeCheck.Close
set rsTimeCheck = Nothing
conn.close set conn = nothing
|
Hey presto. Freshly harvested RSS files on your server ready for manipulation
in Part 2
Below is the complete code for the Harvester (default.asp)
set conn =
server.createobject("ADODB.Connection")
connStr = "DRIVER={SQL Server}; Server=MYSERVER; UID=MYUID; PWD=MYPWD"
Conn.Open connStr
Const fsoForReading = 1
Const fsoForWriting = 2
Const fsoForAppend = 8
Function GetXML(strURL)
Dim objXMLHTTP, strReturn
Set objXMLHTTP =
Server.CreateObject("MSXML2.ServerXMLHTTP")
On Error Resume Next
objXMLHTTP.Open "GET", strURL, false
call objXMLHTTP.Send()
GetXML = objXMLHTTP.ResponseText
Set objXMLHTTP = Nothing
End Function
Set rsTimeCheck = server.CreateObject("ADODB.Recordset")
with rsTimeCheck
.CursorType = 2
.CursorLocation = 2
.LockType = 2
.ActiveConnection = conn
.Source = "SELECT Harvested FROM RSSHarvested"
.Open
end with
StrDiffHours = datediff("h", rsTimeCheck("Harvested") , now())
if Cint(StrDiffHours) => 4 then
Set rs = server.createobject("ADODB.Recordset")
with rs
.cursortype = 3
.cursorlocation = 3
.locktype = 3
.activeconnection =
conn
.Source = "Select
RssUrl, Name from RssLocation"
.open
end with
Set objRegExp = New Regexp
objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "<!DOCTYPE(.|\n)+?>"
Set objFSO =
Server.CreateObject("Scripting.FileSystemObject")
do while not rs.eof
on error resume
next
TextStr =
cStr(GetXML(rs("RssUrl")))
TextStr =
objRegExp.Replace(TextStr, "")
FilePath =
"D:\WebDrive\MyDir\rss\" & rs("Name") & ".xml"
TempPath =
"D:\WebDrive\MyDir\rss\temp\" & rs("Name") & ".xml"
Set ObjTextStream =
objFSO.CreateTextFile(TempPath, true, false)
ObjTextStream.writeline
TextStr
ObjTextStream.close
Set objTextStream1
= objFSO.OpenTextFile(FilePath, fsoForReading)
StrOriginalFile =
objTextStream1.ReadAll
ObjTextstream1.close
Set objTextStream2
= objFSO.OpenTextFile(TempPath, fsoForReading)
StrTempFile =
objTextStream2.ReadAll
ObjTextStream2.close
StrCompare =
StrComp(StrOriginalFile,StrTempFile)
if StrCompare <> 0
then
objFSO.CopyFile
TempPath, "D:\WebDrive\MyDir\rss\"
end if
ObjFSO.DeleteFile
TempPath, False
rs.movenext
loop
ObjFSO.Close
rs.close
set rs = nothing
Set rsUpd =
server.CreateObject("ADODB.Recordset")
with rsUPD
.ActiveConnection =
conn
.CursorType = 2
.LockType = 2
.Source = "SELECT
Harvested FROM RSSHarvested"
.Open
rsUPD("Harvested")
= now()
.update
.close
end with
end if
Set objRegExp = Nothing
rsTimeCheck.Close
set rsTimeCheck = Nothing
conn.close set conn = nothing
|
Part 2 The Aggrigator
Now we have freshly harvested RSS files on our server we need to aggregate and
output them for the users to view
I have a version of this running on my home page (http://www.total-metal.co.uk)
in the Other News (external sites) section at the bottom of the page
First of all we need to open the connection to the server and create a record
set containing the names of the RSS files on our server
set conn =
server.createobject("ADODB.Connection")
connStr = "DRIVER={SQL Server}; Server=MYSERVER; UID=MYUID; PWD=MYPWD"
Conn.Open connStr
set rs = server.createobject("ADODB.Recordset")
with rs
.cursorType = 3
.LockType = 3
.CursorLocation = 3
.ActiveConnection = conn
.Source = "Select Name from RSSLocation"
.open
end with
|
Next I decided that it would be easier to manipulate the RSS files if I loaded
them into a Disconnected Record set containing
-
UID
-
LastBuildDate
-
DateDiff
-
Title
-
ItemTitle
-
ItemLink
Then sort them using ADO filters. In order to do this we need to define our
data types for the Disconnected Record set then Create the Disconnected Record
Set
Const adVarChar = 200
Const adInt = 4
Set DRS = server.CreateObject("ADODB.RECORDSET")
DRS.Fields.append "UID", adInt,4
DRS.Fields.append "LastBuildDate",adVarChar,200
DRS.Fields.append "DateDiff", adInt,4
DRS.Fields.append "Title", adVarChar,200
DRS.Fields.append "ItemTitle", adVarChar,200
DRS.Fields.append "ItemLink", adVarChar,200 DRS.open
|
Next we need to create an instance of the File system object and start a loop
through the record set with a counter on it for our UID field in the
Disconnected Record Set
set ObjFSO =
server.CreateObject("Scripting.filesystemobject")
Counter = 1
do while not rs.eof
on error resume next
|
Next we need to define where our source file and the XSL style sheet are for
when we transform the RSS later in the program
SourceFile =
server.MapPath("rss/" & rs("name") & ".xml") Stylefile =
server.MapPath("rss/AggStyle.xsl")
|
Next we need to create an XML Object and an XSL Object using the
Microsoft.XMLDOM
Set XML =
server.CreateObject("Microsoft.XMLDOM")
XML.async = false
XML.load(SourceFile)
Set XSL = server.CreateObject("Microsoft.XMLDOM")
XSL.async = false
XSL.load(Stylefile)
|
Below is the XSL Style sheet in the XSL Object we will use to transform the RSS
(AggStyle.xsl)
<?xml version="1.0"?>
<xsl:template xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:value-of select="rss/channel/lastBuildDate"/>|<xsl:value-of
select="rss/channel/title"/>|<xsl:value-of
select="rss/channel/item/title"/>|<xsl:value-of
select="rss/channel/item/link"/> </xsl:template>
|
The Style sheet will return the data in the XML as a pipe ( | ) separated
string
Now we have the XML and the XSL Objects we need to transform the XML against
the XSL and split the pipe ( | ) separated string into an Array
| ArrXML =
split(XML.transformNode(XSL), "|")
|
Also we need to get the date the Source file was last modified for the DateDiff
function we will be using later on
Set ObjFile =
ObjFSO.GetFile(SourceFile) StrDate = ObjFile.DateLastModified
|
Nows the time to add the contents of the Array into the Disconnected Record
Set then Close the XML and XSL Objects
DRS.AddNew
DRS("UID") = Counter
DRS("LastBuildDate") = StrDate
DRS("DateDiff") = DateDiff("d", StrDate, date())
DRS("Title") = ArrXML(1)
DRS("ItemTitle") = ArrXML(2)
DRS("ItemLink") = ArrXML(3)
DRS.Update
Set XML = Nothing Set
XSL = Nothing
|
Now we need to increment the counter by 1, close the loop and close any objects
we no longer need
Counter = Counter + 1
rs.MoveNext
loop
rs.close
set rs = nothing
set ObjFile = nothing
Set objFSO = nothing
Conn.close
Set Conn = nothing
|
Now we have populated the Disconnected Record Set we need to sort it and place
some form of filtering on it. We achieve this by using ADO Filters and Sorts
The filter is used too weed out any bad records. There must be a title and the
DateDiff has to be equal or above 0.
The Sort will order the records ascending. This is similar to the Order by
clause in SQL
DRS.MoveFirst
DRS.Filter = "DateDIff >= 0 and Title <> ''" DRS.Sort = "DateDiff"
|
Now it is almost time to display the aggregator we need to output some HTML
| Response.Write "<table
width=""75%"" border=""0"" cellspacing=""0"" ID="Table1">"
|
Now all that is left to do is loop through the Disconnected Record Set and
display the aggregated RSS onto screen
Do while not DRS.EOF
%>
<tr>
<td width="200" align="right"
valign="top"><%=DRS("Title")%> - </td>
<td> <a href="<%=DRS("ItemLink")%>"><%=DRS("ItemTitle")%></a></td>
</tr>
<%DRS.MoveNext
Loop
DRS.Close
set DRS = Nothig
%> </table>
|
Hey presto aggregated, Syndicated RSS news feeds fresh tasty and hot on your
web page
Happy Programming
Andy Barker
Below is the complete code for the aggregator (aggregator.asp)
set conn =
server.createobject("ADODB.Connection")
connStr = "DRIVER={SQL Server}; Server=MYSERVER; UID=MYUID; PWD=MYPWD"
Conn.Open connStr
set rs = server.createobject("ADODB.Recordset")
with rs
.cursorType = 3
.LockType = 3
.CursorLocation = 3
.ActiveConnection = conn
.Source = "Select Name from RSSLocation"
.open
end with
Const adVarChar = 200
Const adInt = 4
Set DRS = server.CreateObject("ADODB.RECORDSET")
DRS.Fields.append "UID", adInt,4
DRS.Fields.append "LastBuildDate",adVarChar,200
DRS.Fields.append "DateDiff", adInt,4
DRS.Fields.append "Title", adVarChar,200
DRS.Fields.append "ItemTitle", adVarChar,200
DRS.Fields.append "ItemLink", adVarChar,200
DRS.open
set ObjFSO = server.CreateObject("Scripting.filesystemobject")
Counter = 1
do while not rs.eof
on error resume next
SourceFile = server.MapPath("rss/" & rs("name") &
".xml")
Stylefile = server.MapPath("rss/AggStyle.xsl")
Set XML = server.CreateObject("Microsoft.XMLDOM")
XML.async = false
XML.load(SourceFile)
Set XSL = server.CreateObject("Microsoft.XMLDOM")
XSL.async = false
XSL.load(Stylefile)
Set ObjFile = ObjFSO.GetFile(SourceFile)
StrDate = ObjFile.DateLastModified
ArrXML = split(XML.transformNode(XSL), "|")
DRS.AddNew
DRS("UID") =
Counter
DRS("LastBuildDate")
= StrDate
DRS("DateDiff") =
DateDiff("d", StrDate, date())
DRS("Title") =
ArrXML(1)
DRS("ItemTitle") =
ArrXML(2)
DRS("ItemLink") =
ArrXML(3)
DRS.Update
Set XML = Nothing
Set XSL = Nothing
Counter = Counter + 1
rs.MoveNext
loop
rs.close
set rs = nothing
set ObjFile = nothing
Set objFSO = nothing
DRS.MoveFirst
DRS.Filter = "DateDIff >= 0 and Title <> ''"
DRS.Sort = "DateDiff"
Response.Write "<table width=""75%"" border=""0"" cellspacing=""0"">"
Do while not DRS.EOF
%>
<tr>
<td width="200" align="right"
valign="top"><%=DRS("Title")%> - </td>
<td> <a href="<%=DRS("ItemLink")%>"><%=DRS("ItemTitle")%></a></td>
</tr>
<%DRS.MoveNext
Loop
DRS.Close
set DRS = Nothig
%> </table>
|
Copyright © 2000-2003 ASPAlliance.com Page Rendered at
12/3/2008 3:14:30 AM
|
| |
|