All Posts All Posts

Handling Encoding Issues When Parsing XML Documents in Go

March 10, 2018·
Software Engineering
·1 min read
Tecker Yu
Tecker Yu
AI Native Cloud Engineer × Part-time Investor

Recently I’ve been working on some RSS parsing tasks, so let me document how to parse XML documents with non-UTF-8 encoding. Here’s the code directly:

package rss_test

import (
    "bytes"
    "encoding/xml"
    "fmt"
    "io"
    "testing"

    "github.com/yujiahaol68/rossy/rss"
    "golang.org/x/net/html/charset"
)

func Test_notUTF8(t *testing.T) {
    r := rss.New()

    // Note: don't use xml.Unmarshal() method, it only works with UTF-8 encoding
    d := xml.NewDecoder(bytes.NewReader([]byte(notUTF8rss)))
    // Set encoding processing function, also works with UTF-8 encoding
    d.CharsetReader = func(s string, reader io.Reader) (io.Reader, error) {
        return charset.NewReader(reader, s)
    }
    err := d.Decode(r)

    if err != nil {
        t.Fatal(err)
    }

    for _, item := range r.ItemList {
        fmt.Printf("* %s\n%s\n", item.Title, item.Link)
    }
}

Views