Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fsharp giraffe fortunes #3863

Merged
merged 33 commits into from
Jun 19, 2018

Conversation

dv00d00
Copy link
Contributor

@dv00d00 dv00d00 commented Jun 15, 2018

Added fortunes test via default and custom HTML rendering.
Orm is Dapper.

Implemented custom renderer to write HTML into memory stream.

Html renderers comparison

Method Mean Error StdDev Scaled Gen 0 Allocated
Standard 105.23 us 1.5500 us 2.1729 us 1.00 68.3594 105.15 KB
Custom 11.70 us 0.0697 us 0.1044 us 0.11 7.1411 10.99 KB

@dv00d00 dv00d00 changed the title Fsharp giraffe fortunes WIP Fsharp giraffe fortunes Jun 15, 2018
@dv00d00 dv00d00 changed the title WIP Fsharp giraffe fortunes RFC Fsharp giraffe fortunes Jun 15, 2018
@dv00d00
Copy link
Contributor Author

dv00d00 commented Jun 15, 2018

would appreciate a review @benaadams @dustinmoris @isaacabraham @gerardtoconnor @isaacabraham

"approach": "Realistic",
"classification": "fullstack",
"database": "None",
"approach": "Stripped",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its still Realistic and likely to become the defacto method once Utf8String becomes mainline dotnet/corefxlab#2350

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In recent run its 1,347,868.6 rps vs 1,220,305.2 rps for giraffe-utf8plaintext vs giraffe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also stripped won't show up in the results by default, so choose it carefully...

@@ -0,0 +1,20 @@
<Project Sdk="Microsoft.NET.Sdk">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks in the benchmarks? 😄

"notes": "",
"versus": "aspcore"
}
},
{
"utf8plaintext": {
"stripped": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8direct?

Copy link
Contributor

@gerardtoconnor gerardtoconnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part, biggest part to probably look into is the use of temporary Memory stream for rendering, try write to response body stream instead if possible.

| ParentNode (e, nodes) -> do writeParentNode target e nodes
| VoidElement (n, attrs) -> do writeStartElement target n attrs

let renderHtmlToStream (encoding:Encoding) node =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a separate memory stream being created rather then using the asp.net core buffers in response? The Memory stream is allocating a resizing byte array internally so writing into it, only to later flush into the response is just waste. we should look to write directly into the response body stream, example of a similar implementation to above is here ByteViewEngine and outperforms most due to writing directly body stream. Is there a important reason MS is being used I have overlooked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing to the http out would mean I need to use task api for every write, which assuming recursive walk would result in more allocations for tasks and binds than memory stream where I could write without the need of async in the first place.

The other bit is that I am not entirely sure about the consequences of writing to the http out stream without knowing the content length upfront, I think kestrell would fallback into chunked content encoding, @benaadams might give a clue here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I provided the example of my ByteViewEngine, it does not use async api and therefore only makes small compromise on blocking. The writing of the small chunks will be hotpath into the kestrel buffers with no blocking until the buffer fills and needs to flush, the buffer is big enough and pages small enough that there will be minimal buffer flushes, on the flushes, unlike a DB roundtrip or file read request, there is no waiting on the return so minimal blocking. The whole operation can be a task, but it is performant on a single sync thread.

On the content length, not sure of the upfront content length omission, that is a little trickier but doable if needed.

We can pool & reuse the memory streams, .SetLength(0) will reset it once done, a more complex concise but painful way ... If we pre-compute & Cache the template length with no variables at start, we can iterate the variables to calculate their additional size each time before writing out to the buffer. the double iteration wont add much more time as the array (List) will already be CPU cached from prior run and was just loaded anyway from DB call, importantly no extra big MS array allocs.

Either way, upfront content length setting is not simple so if anyone has insight into the chunked content encoding, would save us wasting time on any complex work arounds.

Copy link
Contributor

@benaadams benaadams Jun 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aspnetcore platform test does a straight write into the underline Kestrel pipeline (bypassing most of the aspnetcore stack)
content-length => 294,934 rps, ~3.12Gb/s

The middleware test writes to a (cached) stringbuiler, then does a single async write to the body
content-length => 238,135 rps

The mvc test goes via a controller, then renders via a view; so is the full kitchen sink
chunked => 105,687 rps

With h2o at 421,972 rps and doing interesting things with its database driver (which is being investigated aspnet/DataAccessPerformance#42); a lot may be lost in the db driver currently.

I assume; that even if giraffe went chunked, it should be faster than mvc as its closer to middleware in terms of layers of api stack involved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benaadams , this is helpful, fully agree, more often then not, DB calls/driver are the bottleneck and can make these other micro optimisations immaterial.

I think we should try avoid the chunking in principal, (despite being thin layer) it's bit lazy/cheat when pooling a MemoryStream isn't too difficult. I prefer the pooled MS over SB given there is wasted CPU & memory on converting to UTF-16 string, when the ultimate output is UTF-8, best to write & store as UTF-8 as the c# middleware test inefficiently is having to iterate & convert the SB string (utf-16) twice to first count the bytes (UTF-8), and then write the string (utf-16) to UTF-8.

The idea around pre-computing the base template size and then at runtime only calculating the incremental GetByteCount Sum of variables is probably a level of runtime complexity that will be slower then simply pooling buffers so can be ignored.

@dv00d00 do you think MemoryStream pooling way to go then, means you can leave current templating engine as is, just getting/returning MS from pool? It's probably reasonable to pre-allocate pool on start up (50 MemoryStreams of size 1024 bytes?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fully agree, more often then not, DB calls/driver are the bottleneck and can make these other micro optimisations immaterial.

aspnetcore still can output ~3.12Gb/s; which is 30% of network capacity; h20 is at ~4.46Gb/s or 45%; assuming the Db transfer is similar they are both 60% - 90% of the network. Possibly for the next round that's bumping up to 100Gb/s network, so that may all change.

Also aspnetcore platform -> aspnetcore mvc on the same db driver is a 64% drop in perf; so its probably worth tweaking that last little bit out 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also aspnetcore platform -> aspnetcore mvc on the same db driver is a 64% drop in perf; so its probably worth tweaking that last little bit out 😉

That's what we did with Giraffe, MVC being the last little bit out 😉.

With Giraffe we will continue to micro optimise don't worry, in addition to utilizing all the biggest wins in aspnetcore, it's best of both worlds for us, you're doing all the heavy lifting!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results for current (merged) version of giraffe are available https://twitter.com/ben_a_adams/status/1008406784212656130


let view =
data
|> Seq.append extra'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Seq module operations are poor efficiency as involve a lot of temp structures, the Custom implementation of adding extra is better (using lists), why not used here? Is it to be most "intuitive" F# implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably how I would write it without thinking of perf, will replace with list bit

return! ctx.WriteBytesAsync bytes
}

GET >=> choose [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "GET" Method filter is more correct but adding unneeded bloat to the pipeline when it is not a test condition is just wasted resource, I would recommend leaving the method filter out to squeeze tiny bit more perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the spot, it should go

GET >=> choose [
route "/plaintext" >=> text "Hello, World!"
route "/json" >=> json { JsonMessage.message = "Hello, World!" }
route "/fortunes" >=> fortunes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard router is testing each route, one by one till it finds a match so this is inefficient for routes latter in the list like "/fortunes". Would be better to use TokenRouter (needs to be pushed out of beta to be used I believe), or as this is custom, create a custom routing function with a match/Dictionary on path as no complicated parsing required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will definitely take a look into it, thanks.

ctx.Response.StatusCode <- 200
task {
do! ctx.Response.Body.WriteAsync(bytes, 0, bytes.Length)
return none
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above ^^: I am guessing you are not using "text" helper function to allow caching of known string bytes? Although "none" can be returned and will work, this is generally wrong usage as the "choice" functions will continue to try match routes etc until it gets "Some ctx", therefore, to prevent testing of further routes etc, best to return "Some ctx"

ctx.Response.StatusCode <- 200
task {
do! ctx.Response.Body.WriteAsync(bytes, 0, bytes.Length)
return none
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing you are not using "json" helper function to allow caching of known serialized data bytes? Although "none" can be returned and will work, this is generally wrong usage as the "choice" functions will continue to try match routes etc until it gets "Some ctx", therefore, to prevent testing of further routes etc, best to return "Some ctx"

fun (_ : HttpFunc) (ctx : HttpContext) ->

let conn = new NpgsqlConnection(ConnectionString)
ctx.Response.RegisterForDispose conn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why is this used rather then "task { use conn = new … }" for disposable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was task build will create a closure for every bind internally, so I am trying to reduce this to the max.


match attributes with
| [||] ->
do target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are the so many explicit "do" statements used in this module? is there an error if not explicitly included? usually "do" is only required for iteration body statements and constructor initialization code?

@gerardtoconnor gerardtoconnor mentioned this pull request Jun 16, 2018
Dmytro Kushnir added 2 commits June 17, 2018 22:49
added posibility to pass output stream to the html renderer
simplified routing for stock and custom cases
@dv00d00
Copy link
Contributor Author

dv00d00 commented Jun 17, 2018

Thanks for the review guys, will continue to work on it during next week.
I am a bit interested in memory stream pooling as well.

I've benchamrked pooled impl as in https://www.nuget.org/packages/Microsoft.IO.RecyclableMemoryStream/

The results are:

Method Mean Error StdDev Scaled Gen 0 Allocated
NewMemoryStream 9.211 us 0.0491 us 0.0459 us 1.00 3.5858 11.06 KB
PooledMemoryStream 11.287 us 0.0776 us 0.0725 us 1.23 2.5177 7.78 KB

@dv00d00 dv00d00 changed the title RFC Fsharp giraffe fortunes WIP Fsharp giraffe fortunes Jun 17, 2018
@benaadams
Copy link
Contributor

Could use a ThreadStatic cache with a rent and return? (Is how I'd do it in C#); the RecyclableMemoryStream is more for large content streams; whereas this data never gets big.

@gerardtoconnor
Copy link
Contributor

Updates look good.

Thanks for doing initial benchmark, I'm not too surprised, there is usually a large tipping point where these pools pay off. The high Thread lock/syncing on this is a bottleneck so Ben's suggestion is good, we need frequent single small buffers per thread rather then massive irregular ones.

Seems The standard benchmarkdotnet run will have less allocs & GC then TechEmpower , we can also tweak the pool configuration for blocksize etc. When full aspnetcore stack is running with it's memory footprint, there is less free to alloc before GC so in micro benchmarks like this, only minimal GC needed, will be higher in full application test.

I actually remembered, one of my test view engines might work better in this case TemplateViewEngine, as I could tweak it to do my original complex suggestion of counting the pre-complied elements and using a GetByteCount summation on the model and add it to the pre-computed model length? bit of a radical one but just throwing out there, all static elements of template are built into a rendering tree of byte arrays, with model rendered into the gaps writing direct to body.

@dv00d00
Copy link
Contributor Author

dv00d00 commented Jun 18, 2018

Ported @benaadams cache
Results:

Method Mean Error StdDev Median Scaled ScaledSD Gen 0 Allocated
NewMemoryStream 11.91 us 0.2376 us 0.4404 us 11.69 us 1.00 0.00 7.0038 10.78 KB
CustomPool 11.54 us 0.2295 us 0.4366 us 11.49 us 0.97 0.05 4.7760 7.34 KB
MSPool 14.55 us 0.2907 us 0.8295 us 14.26 us 1.22 0.08 5.0507 7.78 KB

win win, will adopt it

@dv00d00
Copy link
Contributor Author

dv00d00 commented Jun 18, 2018

@gerardtoconnor I would really like to try out yours TemplateViews engine. Is it published on nuget?

not sure if we really need stock samples for plaintext and json
@dv00d00 dv00d00 changed the title WIP Fsharp giraffe fortunes Fsharp giraffe fortunes Jun 18, 2018
@dv00d00 dv00d00 changed the title Fsharp giraffe fortunes WIP Fsharp giraffe fortunes Jun 18, 2018
@dv00d00 dv00d00 changed the title WIP Fsharp giraffe fortunes Fsharp giraffe fortunes Jun 18, 2018
@dv00d00
Copy link
Contributor Author

dv00d00 commented Jun 18, 2018

I think it is ready.
Would appreciate a review by TechEmpower team.

Also not entirely sure about "stripped" approach for custom fortunes test, would "micro" classification be enough to justify custom html rendering? Rest of the code does not feel like something too wild?

@benaadams
Copy link
Contributor

Would appreciate a review by TechEmpower team.

Also not entirely sure about "stripped" approach for custom fortunes test, would "micro" classification be enough to justify custom html rendering? Rest of the code does not feel like something too wild?

/cc @msmith-techempower @nbrady-techempower

@NateBrady23
Copy link
Member

@dv00d00 I think "stripped" approach makes sense here. Micro classification is saying you're providing just enough middleware or whatever to make the implementation work. Stripped approach is saying you're writing a lot more custom code specific for this use case, either because the framework doesn't provide the utilities or you're testing out different methods closer to the metal that you might want to incorporate into the framework. What you have now looks right to me. @msmith-techempower anything to add?

@gerardtoconnor
Copy link
Contributor

@dv00d00 I have not got that TemplateViewEngine on nuget, it was only a POC, without the pre-counting of the model bytes to set the contentlength before so as is, it'll chunk so not much use, will have a look at tweaking it to be able to count & set content length.

For now I think your adapted MS pool from Ben's design is a great solution.

"platform": ".NET",
"flavor": "CoreCLR",
"webserver": "Kestrel",
"os": "Linux",
"database_os": "Linux",
"display_name": "Giraffe",
"display_name": "Giraffe, Dapper",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of this, but since we don't have a key for the name of your database approach/library, I guess it will do.

@msmith-techempower msmith-techempower merged commit 047245a into TechEmpower:master Jun 19, 2018
roberthusak pushed a commit to roberthusak/FrameworkBenchmarks that referenced this pull request Nov 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants