-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-Cascaded #[inline]
Propagation
#678
Comments
Thanks for your research!
Yes, this is because this function implemented using a macro which is actually expands into a relatively long code. I never profiled library yet, so maybe all your suggestions are correct, but because I want to rewrite parser I think it is prematurally to optimize inlines now. Currently reading and parsing parts are very coupled which I think is not very good for readability and performance. Indirectly, this theory is confirmed by the fact that by rewriting the code in a new manner, I later discovered that the same approach is used in maybe_xml, which is now often 1.5-2x times faster than quick_xml. Thanks for a dataset. Would be very grateful if you also share the code you used to do comparisons and draw graphs so we can get back to this topic after rewriting. About calamine performance: if it uses serde-based parsing than it is significantly slow that working with raw events. I think it is because when you read events in a cycle, it is better for caches, then when you read one event and then process it, read another one and process it, etc. I have an idea to improve serde performance by buffering events first. As a conclusion: I definitely want to look and improve performance and create the best XML library in Rust, it is just the next step after fixing some bugs which are coupled with current implementation which I hope will be fixed after rewriting. |
Inlining is also a bit of a tricky form of optimization in general. If you bloat the code size too much, larger programs will get slower rather than faster. Debugging & profiling also gets more difficult, etc. Certainly worth it in many cases, though. I agree w/ Mingun about taking a deeper look once the rewrites are done. |
The code I changed was: // reader/buffered_reader.rs
#[inline(always)]
$($async)? fn read_bytes_until $(<$lf>)? (
&mut self,
byte: u8,
buf: &'b mut Vec<u8>,
position: &mut usize,
) -> Result<Option<&'b [u8]>> {...}
#[inline(always)]
$($async)? fn read_element $(<$lf>)? (
&mut self,
buf: &'b mut Vec<u8>,
position: &mut usize,
) -> Result<Option<&'b [u8]>> {...}
#[inline(always)]
pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn read_to_end_into(&mut self, end: QName, buf: &mut Vec<u8>) -> Result<Span> {...} // reader/mod.rs
#[inline(always)]
pub fn buffer_position(&self) -> usize {...}
#[inline(always)]
fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
where
R: XmlSource<'i, B>,
{...}
#[inline(always)]
fn read_until_open<'i, B>(&mut self, buf: B) -> Result<std::result::Result<Event<'i>, B>>
where
R: XmlSource<'i, B>,
{...}
#[inline(always)]
fn read_until_close<'i, B>(&mut self, buf: B) -> Result<Event<'i>>
where
R: XmlSource<'i, B>,
{...} // reader/state.rs
#[inline(always)]
pub fn emit_text<'b>(&mut self, bytes: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn emit_bang<'b>(&mut self, bang_type: BangType, buf: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn emit_end<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn emit_question_mark<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn emit_start<'b>(&mut self, content: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn close_expanded_empty(&mut self) -> Result<Event<'static>> {...} This should be all of it. It was a pretty quick and dirty patch job, but the main idea was to create an end-to-end link with For the plots, I just used a import plotly.express as px
import polars as pl
df = pl.read_csv('CSV')
# requires that `master` time is in the first row
diff = df.item(column='time', row=0)
df = df.with_columns(
percentage_over_master = pl.lit((df.get_column('time') / diff - 1) * -1)
)
fig = px.bar(
x=df['implementation'],
y=df['time'],
template='plotly_dark',
text=df['time'],
height=800,
labels={'y': 'seconds', 'x': 'implementation', 'color': 'implementation'},
color=df['implementation'],
title='quick_xml `#[inline]` Comparison: Runtime'
)
fig.update_traces(textposition='outside', texttemplate = "%{text}s",)
fig.show()
fig = px.bar(
x=df['implementation'],
y=df['percentage_over_master'],
text=df['percentage_over_master'],
template='plotly_dark',
height=800,
labels={'y': 'percentage', 'x': 'implementation', 'color': 'implementation'},
color=df['implementation'],
text_auto='.4p',
title='quick_xml `#[inline]` Comparison: Percentage Over Master'
)
fig.update_traces(textposition='outside')
fig.update_yaxes(showticklabels=False)
fig.show() The
I used |
Related: tafia#678 All methods called only once or two and inlining them in most cases increases performance of our benchmarks: > critcmp master element-parser -t 5 group element-parser master ----- -------------- ------ NsReader::read_resolved_event_into/trim_text = false 1.00 398.9±6.30µs ? ?/sec 1.05 419.6±7.94µs ? ?/sec NsReader::read_resolved_event_into/trim_text = true 1.00 382.1±7.06µs ? ?/sec 1.06 404.0±7.44µs ? ?/sec One event/CData 1.00 56.3±0.97ns ? ?/sec 1.21 68.1±1.35ns ? ?/sec One event/Comment 1.00 141.2±2.52ns ? ?/sec 1.14 161.4±2.79ns ? ?/sec decode_and_parse_document_with_namespaces/rpm_filelists.xml 1.00 95.1±1.45µs 115.5 MB/sec 1.07 102.2±1.65µs 107.5 MB/sec escape_text/escaped_chars_long 1.42 1806.4±34.20ns ? ?/sec 1.00 1275.0±23.98ns ? ?/sec escape_text/escaped_chars_short 1.00 491.5±8.35ns ? ?/sec 1.07 526.6±10.80ns ? ?/sec escape_text/no_chars_to_escape_long 2.06 1831.1±36.31ns ? ?/sec 1.00 887.1±17.00ns ? ?/sec parse_document_nocopy_with_namespaces/libreoffice_document.fodt 1.00 507.2±8.56µs 107.6 MB/sec 1.08 546.2±10.20µs 100.0 MB/sec parse_document_nocopy_with_namespaces/rpm_filelists.xml 1.00 87.2±1.64µs 126.0 MB/sec 1.14 99.2±1.74µs 110.7 MB/sec parse_document_nocopy_with_namespaces/rpm_other.xml 1.00 139.6±2.83µs 158.5 MB/sec 1.07 148.7±2.71µs 148.9 MB/sec parse_document_nocopy_with_namespaces/rpm_primary.xml 1.00 190.5±3.43µs 106.4 MB/sec 1.09 207.9±3.79µs 97.5 MB/sec parse_document_nocopy_with_namespaces/rpm_primary2.xml 1.00 61.7±1.10µs 116.2 MB/sec 1.09 67.5±1.28µs 106.2 MB/sec parse_document_nocopy_with_namespaces/sample_1.xml 1.00 10.5±0.20µs 105.0 MB/sec 1.06 11.1±0.21µs 99.3 MB/sec parse_document_nocopy_with_namespaces/sample_ns.xml 1.00 8.4±0.16µs 86.5 MB/sec 1.08 9.0±0.18µs 80.0 MB/sec parse_document_nocopy_with_namespaces/sample_rss.xml 1.00 786.4±13.46µs 239.8 MB/sec 1.09 859.9±12.82µs 219.3 MB/sec parse_document_nocopy_with_namespaces/test_writer_ident.xml 1.00 29.0±0.55µs 146.4 MB/sec 1.06 30.8±0.55µs 138.0 MB/sec read_event/trim_text = false 1.00 199.3±3.59µs ? ?/sec 1.10 218.5±3.98µs ? ?/sec read_event/trim_text = true 1.00 190.4±3.76µs ? ?/sec 1.11 211.7±4.11µs ? ?/sec unescape_text/no_chars_to_unescape_short 1.00 11.8±0.21ns ? ?/sec 1.06 12.4±0.23ns ? ?/sec
Related: tafia#678 All methods called only once or two and inlining them in most cases increases performance of our benchmarks: > critcmp master element-parser -t 5 group element-parser master ----- -------------- ------ NsReader::read_resolved_event_into/trim_text = false 1.00 398.9±6.30µs ? ?/sec 1.05 419.6±7.94µs ? ?/sec NsReader::read_resolved_event_into/trim_text = true 1.00 382.1±7.06µs ? ?/sec 1.06 404.0±7.44µs ? ?/sec One event/CData 1.00 56.3±0.97ns ? ?/sec 1.21 68.1±1.35ns ? ?/sec One event/Comment 1.00 141.2±2.52ns ? ?/sec 1.14 161.4±2.79ns ? ?/sec decode_and_parse_document_with_namespaces/rpm_filelists.xml 1.00 95.1±1.45µs 115.5 MB/sec 1.07 102.2±1.65µs 107.5 MB/sec escape_text/escaped_chars_long 1.42 1806.4±34.20ns ? ?/sec 1.00 1275.0±23.98ns ? ?/sec escape_text/escaped_chars_short 1.00 491.5±8.35ns ? ?/sec 1.07 526.6±10.80ns ? ?/sec escape_text/no_chars_to_escape_long 2.06 1831.1±36.31ns ? ?/sec 1.00 887.1±17.00ns ? ?/sec parse_document_nocopy_with_namespaces/libreoffice_document.fodt 1.00 507.2±8.56µs 107.6 MB/sec 1.08 546.2±10.20µs 100.0 MB/sec parse_document_nocopy_with_namespaces/rpm_filelists.xml 1.00 87.2±1.64µs 126.0 MB/sec 1.14 99.2±1.74µs 110.7 MB/sec parse_document_nocopy_with_namespaces/rpm_other.xml 1.00 139.6±2.83µs 158.5 MB/sec 1.07 148.7±2.71µs 148.9 MB/sec parse_document_nocopy_with_namespaces/rpm_primary.xml 1.00 190.5±3.43µs 106.4 MB/sec 1.09 207.9±3.79µs 97.5 MB/sec parse_document_nocopy_with_namespaces/rpm_primary2.xml 1.00 61.7±1.10µs 116.2 MB/sec 1.09 67.5±1.28µs 106.2 MB/sec parse_document_nocopy_with_namespaces/sample_1.xml 1.00 10.5±0.20µs 105.0 MB/sec 1.06 11.1±0.21µs 99.3 MB/sec parse_document_nocopy_with_namespaces/sample_ns.xml 1.00 8.4±0.16µs 86.5 MB/sec 1.08 9.0±0.18µs 80.0 MB/sec parse_document_nocopy_with_namespaces/sample_rss.xml 1.00 786.4±13.46µs 239.8 MB/sec 1.09 859.9±12.82µs 219.3 MB/sec parse_document_nocopy_with_namespaces/test_writer_ident.xml 1.00 29.0±0.55µs 146.4 MB/sec 1.06 30.8±0.55µs 138.0 MB/sec read_event/trim_text = false 1.00 199.3±3.59µs ? ?/sec 1.10 218.5±3.98µs ? ?/sec read_event/trim_text = true 1.00 190.4±3.76µs ? ?/sec 1.11 211.7±4.11µs ? ?/sec unescape_text/no_chars_to_unescape_short 1.00 11.8±0.21ns ? ?/sec 1.06 12.4±0.23ns ? ?/sec
Related: tafia#678 All methods called only once or two and inlining them in most cases increases performance of our benchmarks: > critcmp master element-parser -t 5 group element-parser master ----- -------------- ------ NsReader::read_resolved_event_into/trim_text = false 1.00 398.9±6.30µs ? ?/sec 1.05 419.6±7.94µs ? ?/sec NsReader::read_resolved_event_into/trim_text = true 1.00 382.1±7.06µs ? ?/sec 1.06 404.0±7.44µs ? ?/sec One event/CData 1.00 56.3±0.97ns ? ?/sec 1.21 68.1±1.35ns ? ?/sec One event/Comment 1.00 141.2±2.52ns ? ?/sec 1.14 161.4±2.79ns ? ?/sec decode_and_parse_document_with_namespaces/rpm_filelists.xml 1.00 95.1±1.45µs 115.5 MB/sec 1.07 102.2±1.65µs 107.5 MB/sec escape_text/escaped_chars_long 1.42 1806.4±34.20ns ? ?/sec 1.00 1275.0±23.98ns ? ?/sec escape_text/escaped_chars_short 1.00 491.5±8.35ns ? ?/sec 1.07 526.6±10.80ns ? ?/sec escape_text/no_chars_to_escape_long 2.06 1831.1±36.31ns ? ?/sec 1.00 887.1±17.00ns ? ?/sec parse_document_nocopy_with_namespaces/libreoffice_document.fodt 1.00 507.2±8.56µs 107.6 MB/sec 1.08 546.2±10.20µs 100.0 MB/sec parse_document_nocopy_with_namespaces/rpm_filelists.xml 1.00 87.2±1.64µs 126.0 MB/sec 1.14 99.2±1.74µs 110.7 MB/sec parse_document_nocopy_with_namespaces/rpm_other.xml 1.00 139.6±2.83µs 158.5 MB/sec 1.07 148.7±2.71µs 148.9 MB/sec parse_document_nocopy_with_namespaces/rpm_primary.xml 1.00 190.5±3.43µs 106.4 MB/sec 1.09 207.9±3.79µs 97.5 MB/sec parse_document_nocopy_with_namespaces/rpm_primary2.xml 1.00 61.7±1.10µs 116.2 MB/sec 1.09 67.5±1.28µs 106.2 MB/sec parse_document_nocopy_with_namespaces/sample_1.xml 1.00 10.5±0.20µs 105.0 MB/sec 1.06 11.1±0.21µs 99.3 MB/sec parse_document_nocopy_with_namespaces/sample_ns.xml 1.00 8.4±0.16µs 86.5 MB/sec 1.08 9.0±0.18µs 80.0 MB/sec parse_document_nocopy_with_namespaces/sample_rss.xml 1.00 786.4±13.46µs 239.8 MB/sec 1.09 859.9±12.82µs 219.3 MB/sec parse_document_nocopy_with_namespaces/test_writer_ident.xml 1.00 29.0±0.55µs 146.4 MB/sec 1.06 30.8±0.55µs 138.0 MB/sec read_event/trim_text = false 1.00 199.3±3.59µs ? ?/sec 1.10 218.5±3.98µs ? ?/sec read_event/trim_text = true 1.00 190.4±3.76µs ? ?/sec 1.11 211.7±4.11µs ? ?/sec unescape_text/no_chars_to_unescape_short 1.00 11.8±0.21ns ? ?/sec 1.06 12.4±0.23ns ? ?/sec
Related: #678 All methods called only once or two and inlining them in most cases increases performance of our benchmarks: > critcmp master element-parser -t 5 group element-parser master ----- -------------- ------ NsReader::read_resolved_event_into/trim_text = false 1.00 398.9±6.30µs ? ?/sec 1.05 419.6±7.94µs ? ?/sec NsReader::read_resolved_event_into/trim_text = true 1.00 382.1±7.06µs ? ?/sec 1.06 404.0±7.44µs ? ?/sec One event/CData 1.00 56.3±0.97ns ? ?/sec 1.21 68.1±1.35ns ? ?/sec One event/Comment 1.00 141.2±2.52ns ? ?/sec 1.14 161.4±2.79ns ? ?/sec decode_and_parse_document_with_namespaces/rpm_filelists.xml 1.00 95.1±1.45µs 115.5 MB/sec 1.07 102.2±1.65µs 107.5 MB/sec escape_text/escaped_chars_long 1.42 1806.4±34.20ns ? ?/sec 1.00 1275.0±23.98ns ? ?/sec escape_text/escaped_chars_short 1.00 491.5±8.35ns ? ?/sec 1.07 526.6±10.80ns ? ?/sec escape_text/no_chars_to_escape_long 2.06 1831.1±36.31ns ? ?/sec 1.00 887.1±17.00ns ? ?/sec parse_document_nocopy_with_namespaces/libreoffice_document.fodt 1.00 507.2±8.56µs 107.6 MB/sec 1.08 546.2±10.20µs 100.0 MB/sec parse_document_nocopy_with_namespaces/rpm_filelists.xml 1.00 87.2±1.64µs 126.0 MB/sec 1.14 99.2±1.74µs 110.7 MB/sec parse_document_nocopy_with_namespaces/rpm_other.xml 1.00 139.6±2.83µs 158.5 MB/sec 1.07 148.7±2.71µs 148.9 MB/sec parse_document_nocopy_with_namespaces/rpm_primary.xml 1.00 190.5±3.43µs 106.4 MB/sec 1.09 207.9±3.79µs 97.5 MB/sec parse_document_nocopy_with_namespaces/rpm_primary2.xml 1.00 61.7±1.10µs 116.2 MB/sec 1.09 67.5±1.28µs 106.2 MB/sec parse_document_nocopy_with_namespaces/sample_1.xml 1.00 10.5±0.20µs 105.0 MB/sec 1.06 11.1±0.21µs 99.3 MB/sec parse_document_nocopy_with_namespaces/sample_ns.xml 1.00 8.4±0.16µs 86.5 MB/sec 1.08 9.0±0.18µs 80.0 MB/sec parse_document_nocopy_with_namespaces/sample_rss.xml 1.00 786.4±13.46µs 239.8 MB/sec 1.09 859.9±12.82µs 219.3 MB/sec parse_document_nocopy_with_namespaces/test_writer_ident.xml 1.00 29.0±0.55µs 146.4 MB/sec 1.06 30.8±0.55µs 138.0 MB/sec read_event/trim_text = false 1.00 199.3±3.59µs ? ?/sec 1.10 218.5±3.98µs ? ?/sec read_event/trim_text = true 1.00 190.4±3.76µs ? ?/sec 1.11 211.7±4.11µs ? ?/sec unescape_text/no_chars_to_unescape_short 1.00 11.8±0.21ns ? ?/sec 1.06 12.4±0.23ns ? ?/sec
When looking at the code I noticed that there was a break in
#[inline]
propagation on functions. Given that#[inline]
is non-transitive, cross crate inline optimization can be very easily disrupted if the chain breaks at the wrong depth.When running tests with this setup:
I noticed that although this function has
#[inline]
:The next funtion its calling does not:
In trying to see the effects of the break in the
#[inline]
propgation, I ran a few tests. I went through and at first added#[inline]
to the functions so that there is an end-to-end#[inline]
chain, and then later went through and changed it to be#[inline(always)]
to see what the differences would be.I ran
lto=true
as well to see if there is any benefit to#[inline]
or if it just ultimately matches what a manual#[inline]
decoration brings, looking to see if rather than adding the attribute, just give a heads up in the docs that lto would benefit this crate to some degree.The xml file was a 1.3GB file from the worksheet of this dataset.
Benchmarks
Percentage over
master
:Seeing that the changes even benefited
lto
, I think this is a worth while add.Profile
Current
master
:#[inline]
:#[inline(always)]
:#[inline(always)]
+lto
:One thing I did notice was a
memcpy
instruction that was 25% of the samples:I'm not sure if this is part of the program or if its just an artifact of the profiling.
The text was updated successfully, but these errors were encountered: