Optimize selectors matching #510

ypconstante · 2023-12-22T18:28:34Z

Apply some optimizations on selector matching to try to avoid expensive calls:

Use list instead of regex for String.split - benchmark
Don't try to match class if attribute value bit_size is smaller than the first class in the selector
Reorder selector classes to put longest classes first, to optimize the check above
Use equals when there's a single class selector and the class attribute value has the same bit_size as the selector
Compute minimal class attribute value bit_size, and avoid split if this condition isn't match
on match?, check if selector requires descendant, and skip other checks if the current node doesn't have descendants

##### With input child #####
Name                 ips        average  deviation         median         99th %
bench (pr)        103.12        9.70 ms    ±18.72%        9.61 ms       14.42 ms
bench              81.41       12.28 ms    ±18.12%       12.18 ms       18.23 ms

Comparison: 
bench (pr)        103.12
bench              81.41 - 1.27x slower +2.59 ms

Memory usage statistics:

Name          Memory usage
bench (pr)        10.83 MB
bench             11.63 MB - 1.07x memory usage +0.80 MB

**All measurements for memory usage were the same**

##### With input descendant #####
Name                 ips        average  deviation         median         99th %
bench (pr)        115.90        8.63 ms    ±18.97%        7.92 ms       13.61 ms
bench              91.88       10.88 ms    ±17.60%       10.12 ms       16.85 ms

Comparison: 
bench (pr)        115.90
bench              91.88 - 1.26x slower +2.26 ms

Memory usage statistics:

Name          Memory usage
bench (pr)        10.81 MB
bench             11.55 MB - 1.07x memory usage +0.73 MB

**All measurements for memory usage were the same**

##### With input multiple classes #####
Name                 ips        average  deviation         median         99th %
bench (pr)        138.85        7.20 ms    ±15.04%        6.70 ms       10.50 ms
bench              96.60       10.35 ms    ±15.46%        9.54 ms       15.73 ms

Comparison: 
bench (pr)        138.85
bench              96.60 - 1.44x slower +3.15 ms

Memory usage statistics:

Name          Memory usage
bench (pr)        10.85 MB
bench             11.54 MB - 1.06x memory usage +0.69 MB

**All measurements for memory usage were the same**

##### With input single class #####
Name                 ips        average  deviation         median         99th %
bench (pr)        127.96        7.81 ms    ±14.07%        7.28 ms       11.16 ms
bench              88.56       11.29 ms    ±17.78%       10.79 ms       16.49 ms

Comparison: 
bench (pr)        127.96
bench              88.56 - 1.44x slower +3.48 ms

Memory usage statistics:

Name          Memory usage
bench (pr)        10.81 MB
bench             11.54 MB - 1.07x memory usage +0.74 MB

read_file = fn name ->
  __ENV__.file
  |> Path.dirname()
  |> Path.join(name)
  |> File.read!()
end

html_input = read_file.("medium.html")

[{"html", _, _} = html | _] = Floki.parse_document!(html_input)

Benchee.run(
  %{
    "bench" => fn selector -> Floki.Finder.find(html, selector)  end
  },
  time: 10,
  memory_time: 2,
  inputs: %{
    "single class" => ".mw-highlight",
    "multiple classes" => ".citation.pressrelease" ,
    "child" => ".citation > .external > *",
    "descendant" => ".external *" ,
  },
)

philss

Nice optimizations! 🚀

Let's see if we can solve that issue with the whitespaces :)

philss · 2023-12-27T23:17:05Z

lib/floki/selector.ex

+  end
+
+  defp do_classes_matches?(class_attr_value, [class]) do
+    Enum.member?(String.split(class_attr_value, [" ", "\t", "\n"]), class)


I think this is not exactly the same from the regex version, because the occurrence of multiple whitespace chars in sequence are treated differently.

For example, if we have a class attribute that contains multiple spaces between classes values, the former version would capture that correctly, while the new version does not:

class_attr_value = "my class\tanother one" # new String.split(class_attr_value, [" ", "\t", "\n"]) #=> ["my", "class", "another", "", "", "one"] # old String.split(class_attr_value, ~r/\s+/) #=> ["my", "class", "another", "one"]

We can, however, reject the empty binaries from the match. Can you try this and see if there is an impact in performance?

Done, fa64b01.
Adding trim: true didn't had performance impact in the benchmark, but it did reduce the memory usage, since we won't have this empty strings in the list.

philss · 2023-12-27T23:17:26Z

lib/floki/selector.ex

-    classes -- String.split(class_attr_value, ~r/\s+/) == []
+    min_size = Enum.reduce(classes, -1, fn item, acc -> acc + 1 + bit_size(item) end)
+    can_match? = bit_size(class_attr_value) >= min_size
+    can_match? && classes -- String.split(class_attr_value, [" ", "\t", "\n"]) == []


Same thing here :)

philss · 2023-12-27T23:17:47Z

lib/floki/selector/attribute_selector.ex

@@ -63,7 +63,7 @@ defmodule Floki.Selector.AttributeSelector do
    s.attribute
    |> get_value(attributes)
    # Splits by whitespaces ("a  b c" -> ["a", "b", "c"])
-    |> String.split(~r/\s+/)
+    |> String.split([" ", "\t", "\n"])


Here as well. I think we need to consider the same problems here.

philss · 2023-12-27T23:17:56Z

lib/floki/selector/attribute_selector.ex

@@ -103,7 +103,7 @@ defmodule Floki.Selector.AttributeSelector do

  def match?(attributes, s = %AttributeSelector{match_type: :includes, value: value}) do
    get_value(s.attribute, attributes)
-    |> String.split(~r/\s+/)
+    |> String.split([" ", "\t", "\n"])


And here :)

lib/floki/selector/parser.ex

philss · 2023-12-28T00:39:21Z

@ypconstante thank you!

ypconstante added 5 commits December 22, 2023 08:13

Use list instead of regex for string split

341ed1d

Skip class_attr_value split if size smaller than first class

e5858ff

Skip class_attr_value split when same size as single class

d03143d

Compute minimal size to try to skip class_attr_value split

9ae289b

Reorder selector classes to improve performance

99e6a32

ypconstante mentioned this pull request Dec 22, 2023

Use list instead of regex for string split #509

Closed

Skip when selector requires children and node doesn't have one

75f9507

ypconstante force-pushed the optimize-selectors branch from 26b07ff to 75f9507 Compare December 22, 2023 18:29

philss reviewed Dec 27, 2023

View reviewed changes

ypconstante added 2 commits December 27, 2023 20:56

Remove empty values on split

fa64b01

Add comment to optimize_selector

e4a8a8a

philss approved these changes Dec 28, 2023

View reviewed changes

philss merged commit 2ae5e2a into philss:main Dec 28, 2023
9 checks passed

ypconstante deleted the optimize-selectors branch December 28, 2023 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize selectors matching #510

Optimize selectors matching #510

ypconstante commented Dec 22, 2023

philss left a comment

philss Dec 27, 2023

ypconstante Dec 28, 2023

philss Dec 27, 2023

ypconstante Dec 28, 2023

philss Dec 27, 2023

ypconstante Dec 28, 2023

philss Dec 27, 2023

ypconstante Dec 28, 2023

philss commented Dec 28, 2023

Optimize selectors matching #510

Optimize selectors matching #510

Conversation

ypconstante commented Dec 22, 2023

philss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philss commented Dec 28, 2023