Skip to content
This repository has been archived by the owner on Apr 4, 2021. It is now read-only.

goxpath needs a rewrite #6

Open
ChrisTrenkamp opened this issue Nov 4, 2016 · 3 comments
Open

goxpath needs a rewrite #6

ChrisTrenkamp opened this issue Nov 4, 2016 · 3 comments

Comments

@ChrisTrenkamp
Copy link
Owner

ChrisTrenkamp commented Nov 4, 2016

goxpath was mainly written for myself because the tools I used were either too complicated and cumbersome or insufficient. I made a lot of decisions early on that I now regret. Some of them were fixable, but others are inherit to its design and cannot be fixed without gutting huge portions of the library. I don't use goxpath as much as I was about half a year ago (and the commit history reflects that), but whenever I do, it has a bunch of flaws that bug the hell out of me. Here's a short list of ones that desperately need to be addressed:

tree's are not streamed

Anyone who has used goxpath on big XML files will notice this. My job is to work with many, many small XML files, so it was never really a concern, but as I start working with bigger XML files, this is starting to turn its ugly head.

The problem, however, is the core Go XML library does not have the ability to read previous elements, so if the tree automatically releases memory in the background, there's no way to get it back without re-opening the file.

tree's are format-agnostic, but the core library depends specifically on XML elements

I should have thought the core library out more thoroughly before putting in explicit XML dependencies. I should have made the data points on tree node's an interface so other format's can just define their own internal data points without making some ugly translation into XML.

tree's cannot be altered

It's becoming more apparent to me that people want this library for editing tree's, not just querying them. It can be done, but the default xmltree package was not made with editing in mind.

The public API is ugly

I focused a lot of the internal library and didn't pay much attention to the public API. It's not as fluent as I would like it to be, and it needs to be more elegant.

@suntong
Copy link

suntong commented Dec 4, 2016

subscribe.

@ChrisTrenkamp
Copy link
Owner Author

ChrisTrenkamp commented Jun 27, 2017

Been busy and still have no time to work on this, but finally came up with some realistic pseudocode that outlines some fixes to what I've complained about:

package main

import "github.com/ChrisTrenkamp/xsel"

type FooStruct struct {
    Foo string `xml:"string"`
}

func main() {
    xReader, _ := xsel.XmlReader(os.Open("..."))
    store, _ := xsel.InMemoryStore(xReader)

    for result := range xsel.Select("//foo", store) {
        f := &FooStruct{}
        _ := xsel.UnmarshalXml(f, result)
    }

    //Each of these returns an error as well, but it should get the point across
    var arrResult []xsel.Result = xsel.All(xsel.Select("//foo", store))
    var strResult string = xsel.AsString(xsel.Select("//foo", store))
    var numResult float64 = xsel.AsNum(xsel.Select("//foo", store))
}

First things first, it'll be a complete rewrite. The only thing that'll be transferred over is all the test cases. Let's call it xsel; it's easier to type than goxpath and has a bit of a catchier name (pretty sure I would fail marketing 101).

The first interface is the NodeReader (xReader), which will be equivalent to Java's STaX interface, in that it only reads the next node when it's told to, and does not store it in memory. A default implementation will be supplied for reading XML files (xsel.XmlReader), and drop-in replacements for JSON or anything can be implemented.

The second interface is the Storage. It will be responsible for appending and removing nodes from the tree. A default in-memory storage will be supplied, but drop-in replacements using a database-backed storage like BoltDB can be created.

When submitting an XPath query (xsel.Select), it'll return a struct channel with two fields: an error and a value. This, combined with the STaX-like puller, and a disk-backed storage, will allow for the memory-efficiency and streaming behaviour that goxpath has struggled to accomplish.

A common situation I kept running into was unmarshalling the XML into a struct, which, to be honest, is probably what 95% of people using XPath really need. There will be a default implementation for reading it into xml-tagged structs.

The last three function calls are simply helpers around the XPath query.

@ChrisTrenkamp
Copy link
Owner Author

Better 4 years late than never, I suppose:

https://github.com/ChrisTrenkamp/xsel

Here's a recap on what I wanted to accomplish:

tree's are not streamed

A streaming API was incredibly difficult to create, and difficult and awkward to use. This goal has been trashed.

tree's are format-agnostic, but the core library depends specifically on XML elements

xsel's core XPath logic is disconnected from any direct XML dependencies. XML tropes such as namespaces and processing instructions are defined in the package, but are not directly tied to any XML libraries and are not required.

tree's cannot be altered

xsel has two general interfaces, node's for defining the data points themselves, and cursor's for defining the parent-child relationships and position numbers. It mandates the order in which namespaces and attributes are created, and it mandates how position numbers are created, but nothing else beyond that. It should be possible to define your own interface for creating modifiable tree structures.

The public API is ugly

The following is an example for defining a custom function's in xsel. Should be easier to read and follow.

package main

import (
	"bytes"
	"fmt"

	"github.com/ChrisTrenkamp/xsel/exec"
	"github.com/ChrisTrenkamp/xsel/grammar"
	"github.com/ChrisTrenkamp/xsel/node"
	"github.com/ChrisTrenkamp/xsel/parser"
	"github.com/ChrisTrenkamp/xsel/store"
)

func main() {
	xml := `
<root>
	<a>This is an element.</a>
	<!-- This is a comment. -->
</root>
`

	isComment := func(context exec.Context, args ...exec.Result) (exec.Result, error) {
		nodeSet, isNodeSet := context.Result().(exec.NodeSet)

		if !isNodeSet || len(nodeSet) == 0 {
			return exec.Bool(false), nil
		}

		_, isComment := nodeSet[0].Node().(node.Comment)
		return exec.Bool(isComment), nil
	}

	contextSettings := func(c *exec.ContextSettings) {
		c.FunctionLibrary[exec.Name("", "is-comment")] = isComment
	}

	xpath := grammar.MustBuild(`//node()[is-comment()]`)
	parser := parser.ReadXml(bytes.NewBufferString(xml))
	cursor, _ := store.CreateInMemory(parser)
	result, _ := exec.Exec(cursor, &xpath, contextSettings)

	fmt.Println(result) // This is a comment.
}

There are also other improvements, such as separating the node information from the parent-child/position relationship's, making the API cleaner and easier to extend. It also uses a parser generator. It was a fun learning experience to hand-write my own lexer/parser, but it was also a mess and a nightmare to maintain.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants