HTML encoding issues - "Â" character showing up instead of " "
I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.
The process works like this:
- Pull an HTML template from a DB with tokens in it to be replaced (e.g. "~CompanyName~", "~CustomerName~", etc.)
- Replace the tokens with real data
- Tidy the HTML with a simple regex function that property formats HTML tag attribute values (ensures quotation marks, etc, since ActivePDF's rendering engine hates anything but single quotes around attribute values)
- Send off the HTML to a web service that creates the PDF.
Somewhere in that mess, the non-breaking spaces from the HTML template (the
s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.
My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it turns it all into gobbledegook doesn't change anything.
Private Shared Function ConvertToUTF8(ByVal html As String) As String
Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
Dim source As Byte() = isoEncoding.GetBytes(html)
Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function
Any ideas?
EDIT:
I'm getting by with this for now, though it hardly seems like a good solution:
Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
Return Regex.Replace(html, "[^u0000-u007F]", " ")
End Function