Extract Text From HTML
ASPAlliance.com: The #1 ASP.NET Community
The ASPSmith
Search
D: | Domains | Authors.aspalliance.com | Stevesmith | Articles | Extract Text From HTML
Extract Text From HTML

By Steven Smith

[Example/ClassicASP]
[Example/VB.Net]
[Example/C#]

In the course of improving this website's search engine, I wrote a routine that would extract the text from an article given a URL,strip out the HTML, and then convert all of the white space and carriage returns into single spaces.This was done to compress the size of the text involved, which was then stored in the database and used for full-text searches. In order to strip out all of the HTML tags from the document, I used regular expressions (with some help from Remas).

My code was written using ASP and VBScript (version 5.5 for RegExp support), but I'll show how it can easily be done in ASP.NET.

First, let look at the source code of the ASP function:

Function RemoveHTML( strText )
	Dim RegEx
	Set RegEx = New RegExp
	RegEx.Pattern = "<[^>]*>"
	RegEx.Global = True
	strText = Replace(LCase(strText), "<br>", chr(10))
	RemoveHTML = RegEx.Replace(strText, "")
End Function

Note: This fucntion will return all lower case output. If you want to maintain the case of your content, remove the LCase statement and use 4 different replaces, one each for <br>, <Br>, <bR>, and <BR>.

Ok, now let's see how it would be done in ASP.NET. Just to make this article more interesting, I'll list the code in all three standard languages of .NET: VB, C#, and JScript.

   <%@ Import Namespace="System.IO" %>
   <%@ Import Namespace="System.Text.RegularExpressions" %>
   <script language="VB" runat="server">
   
   Sub SubmitBtn_Click(sender As Object, e As EventArgs)
    Dim strInput As String
    Dim strOutput As String
    strInput = Text1.Text
    strOutput = Regex.Replace(strInput, "<[^>]*>", " ")
10    output.Text = strOutput
11    output_raw.Text = Server.HtmlEncode(Text1.Text)
12   End Sub
13   
14   </script>
15   <html>
16   <body>
17   <a href="/stevesmith/articles/removehtml.asp">Return To Article</a>
18   <form runat="server">
19   <table width="100%">
20   <tr>
21    <td valign="top" rowspan="2">
22    Add HTML Formatted Text<br>
23    <asp:TextBox TextMode="multiline" id="Text1" width="200px"
24    height="80px" runat="server" /><br>
25    <asp:Button OnClick="SubmitBtn_Click" Text="Format Text" Runat="server"/>
26    </td>
27    <td valign="top">
28    Unformatted Text:
29    </td>
30    <td valign="top"
31    <pre><asp:label id="output_raw" runat="server" /></pre>
32    </td>
33   </tr>
34   <tr>
35    <td valign="top">
36    HTML-stripped Output:
37    </td>
38    <td valign="top">
39    <pre><asp:label id="output" runat="server" /></pre>
40    </td>
41   </tr>
42   </table>
43   </form>
44   </body>
45   </html>
C# VB JScript

The full source of the example is shown. You can run the example and see how it works.

Other useful links on regular expressions:





ASP.NET Developer's Cookbook, By Steven Smith, Rob Howard, ASPAlliance.com 

ASP.NET By Example, By Steven Smith 




Steven Smith, MCSE + Internet (4.0)
Last Modified: 8/8/2001 9:19:45 AM
History: 1/25/2004 6:10:06 PM