Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize selectors matching #510

Merged
merged 8 commits into from
Dec 28, 2023
Merged

Conversation

ypconstante
Copy link
Contributor

Apply some optimizations on selector matching to try to avoid expensive calls:

  • Use list instead of regex for String.split - benchmark
  • Don't try to match class if attribute value bit_size is smaller than the first class in the selector
  • Reorder selector classes to put longest classes first, to optimize the check above
  • Use equals when there's a single class selector and the class attribute value has the same bit_size as the selector
  • Compute minimal class attribute value bit_size, and avoid split if this condition isn't match
  • on match?, check if selector requires descendant, and skip other checks if the current node doesn't have descendants
##### With input child #####
Name                 ips        average  deviation         median         99th %
bench (pr)        103.12        9.70 ms    ±18.72%        9.61 ms       14.42 ms
bench              81.41       12.28 ms    ±18.12%       12.18 ms       18.23 ms

Comparison: 
bench (pr)        103.12
bench              81.41 - 1.27x slower +2.59 ms

Memory usage statistics:

Name          Memory usage
bench (pr)        10.83 MB
bench             11.63 MB - 1.07x memory usage +0.80 MB

**All measurements for memory usage were the same**

##### With input descendant #####
Name                 ips        average  deviation         median         99th %
bench (pr)        115.90        8.63 ms    ±18.97%        7.92 ms       13.61 ms
bench              91.88       10.88 ms    ±17.60%       10.12 ms       16.85 ms

Comparison: 
bench (pr)        115.90
bench              91.88 - 1.26x slower +2.26 ms

Memory usage statistics:

Name          Memory usage
bench (pr)        10.81 MB
bench             11.55 MB - 1.07x memory usage +0.73 MB

**All measurements for memory usage were the same**

##### With input multiple classes #####
Name                 ips        average  deviation         median         99th %
bench (pr)        138.85        7.20 ms    ±15.04%        6.70 ms       10.50 ms
bench              96.60       10.35 ms    ±15.46%        9.54 ms       15.73 ms

Comparison: 
bench (pr)        138.85
bench              96.60 - 1.44x slower +3.15 ms

Memory usage statistics:

Name          Memory usage
bench (pr)        10.85 MB
bench             11.54 MB - 1.06x memory usage +0.69 MB

**All measurements for memory usage were the same**

##### With input single class #####
Name                 ips        average  deviation         median         99th %
bench (pr)        127.96        7.81 ms    ±14.07%        7.28 ms       11.16 ms
bench              88.56       11.29 ms    ±17.78%       10.79 ms       16.49 ms

Comparison: 
bench (pr)        127.96
bench              88.56 - 1.44x slower +3.48 ms

Memory usage statistics:

Name          Memory usage
bench (pr)        10.81 MB
bench             11.54 MB - 1.07x memory usage +0.74 MB

read_file = fn name ->
  __ENV__.file
  |> Path.dirname()
  |> Path.join(name)
  |> File.read!()
end

html_input = read_file.("medium.html")

[{"html", _, _} = html | _] = Floki.parse_document!(html_input)

Benchee.run(
  %{
    "bench" => fn selector -> Floki.Finder.find(html, selector)  end
  },
  time: 10,
  memory_time: 2,
  inputs: %{
    "single class" => ".mw-highlight",
    "multiple classes" => ".citation.pressrelease" ,
    "child" => ".citation > .external > *",
    "descendant" => ".external *" ,
  },
)

Copy link
Owner

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimizations! 🚀

Let's see if we can solve that issue with the whitespaces :)

end

defp do_classes_matches?(class_attr_value, [class]) do
Enum.member?(String.split(class_attr_value, [" ", "\t", "\n"]), class)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not exactly the same from the regex version, because the occurrence of multiple whitespace chars in sequence are treated differently.

For example, if we have a class attribute that contains multiple spaces between classes values, the former version would capture that correctly, while the new version does not:

class_attr_value = "my class\tanother   one"

# new
String.split(class_attr_value, [" ", "\t", "\n"])
#=> ["my", "class", "another", "", "", "one"]

# old
String.split(class_attr_value, ~r/\s+/)
#=> ["my", "class", "another", "one"]

We can, however, reject the empty binaries from the match. Can you try this and see if there is an impact in performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, fa64b01.
Adding trim: true didn't had performance impact in the benchmark, but it did reduce the memory usage, since we won't have this empty strings in the list.

classes -- String.split(class_attr_value, ~r/\s+/) == []
min_size = Enum.reduce(classes, -1, fn item, acc -> acc + 1 + bit_size(item) end)
can_match? = bit_size(class_attr_value) >= min_size
can_match? && classes -- String.split(class_attr_value, [" ", "\t", "\n"]) == []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -63,7 +63,7 @@ defmodule Floki.Selector.AttributeSelector do
s.attribute
|> get_value(attributes)
# Splits by whitespaces ("a b c" -> ["a", "b", "c"])
|> String.split(~r/\s+/)
|> String.split([" ", "\t", "\n"])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well. I think we need to consider the same problems here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -103,7 +103,7 @@ defmodule Floki.Selector.AttributeSelector do

def match?(attributes, s = %AttributeSelector{match_type: :includes, value: value}) do
get_value(s.attribute, attributes)
|> String.split(~r/\s+/)
|> String.split([" ", "\t", "\n"])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib/floki/selector/parser.ex Show resolved Hide resolved
@philss philss merged commit 2ae5e2a into philss:main Dec 28, 2023
9 checks passed
@philss
Copy link
Owner

philss commented Dec 28, 2023

@ypconstante thank you!

@ypconstante ypconstante deleted the optimize-selectors branch December 28, 2023 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants