Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add container/queue #27935

Open
christianrpetrin opened this issue Sep 29, 2018 · 95 comments
Open

proposal: add container/queue #27935

christianrpetrin opened this issue Sep 29, 2018 · 95 comments

Comments

@christianrpetrin
Copy link

christianrpetrin commented Sep 29, 2018

Proposal: Built in support for high performance unbounded queue

Author: Christian Petrin.

Last updated: November 26, 2018

Discussion at: #27935

Design document at https://github.com/golang/proposal/blob/master/design/27935-unbounded-queue-package.md

Abstract

I propose to add a new package, "container/queue", to the standard library to support an in-memory, unbounded, general purpose queue implementation.

Queues in computer science is a very old, well established and well known concept, yet Go doesn't provide a specialized, safe to use, performant and issue free unbounded queue implementation.

Buffered channels provide an excellent option to be used as a queue, but buffered channels are bounded and so doesn't scale to support very large data sets. The same applies for the standard ring package.

The standard list package can be used as the underlying data structure for building unbounded queues, but the performance yielded by this linked list based implementation is not optimal.

Implementing a queue using slices as suggested here is a feasible approach, but the performance yielded by this implementation can be abysmal in some high load scenarios.

Background

Queues that grows dynamically has many uses. As an example, I'm working on a logging system called CloudLogger that sends all logged data to external logging management systems, such as Stackdriver and Cloudwatch. External logging systems typically rate limit how much data their service will accept for a given account and time frame. So in a scenario where the hosting application is logging more data than the logging management system will accept at a given moment, CloudLogger has to queue the extra logs and send them to the logging management system at a pace the system will accept. As there's no telling how much data will have to be queued as it depends on the current traffic, an unbounded, dynamically growing queue is the ideal data structure to be used. Buffered channels in this scenario is not ideal as they have a limit on how much data they will accept, and once that limit has been reached, the producers (routines adding to the channel) start to block, making the adding to the channel operation an "eventual" synchronous process. A fully asynchronous operation in this scenario is highly desirable as logging data should not slow down significantly the hosting application.

Above problem is a problem that, potentially, every system that calls another system faces. And in the cloud and microservices era, this is an extremely common scenario.

Due to the lack of support for built in unbounded queues in Go, Go engineers are left to either:

  1. Research and use external packages, or
  2. Build their own queue implementation

Both approaches are riddled with pitfalls.

Using external packages, especially in enterprise level software, requires a lot of care as using external, potentially untested and hard to understand code can have unwanted consequences. This problem is made much worse by the fact that, currently, there's no well established and disseminated open source Go queue implementation according to this stackoverflow discussion, this github search for Go queues and Awesome Go.

Building a queue, on the other hand, might sound like a compelling argument, but building efficient, high performant, bug free unbounded queue is a hard job that requires a pretty solid computer science foundation as well a good deal of time to research different design approaches, test different implementations, make sure the code is bug and memory leak free, etc.

In the end what Go engineers have been doing up to this point is building their own queues, which are for the most part inefficient and can have disastrous, yet hidden performance and memory issues. As examples of poorly designed and/or implemented queues, the approaches suggested here and here (among many others), requires linear copy of the internal slice for resizing purposes. Some implementations also has memory issues such as an ever expanding internal slice and memory leaks.

Proposal

I propose to add a new package, "container/queue", to the standard library to support in-memory unbounded queues. The proposed queue implementation offers excellent performance and very low memory consumption when comparing it to three promising open source implementations (gammazero, phf and juju); to use Go channels as queue; the standard list package as a queue as well as six other experimental queue implementations.

The proposed queue implementation offers the most balanced approach to performance given different loads, being significantly faster and still uses less memory than every other queue implementation in the tests.

The closest data structure Go has to offer for building dynamically growing queues for large data sets is the standard list package. When comparing the proposed solution to using the list package as an unbounded queue (refer to "BenchmarkList"), the proposed solution is consistently faster than using the list package as a queue as well as displaying a much lower memory footprint.

Reasoning

There's two well accepted approaches to implementing queues when in comes to the queue underlying data structure:

  1. Using linked list
  2. Using array

Linked list as the underlying data structure for an unbounded queue has the advantage of scaling efficiently when the underlying data structure needs to grow to accommodate more values. This is due to the fact that the existing elements doesn't need to be repositioned or copied around when the queue needs to grow.

However, there's a few concerns with this approach:

  1. The use of prev/next pointers for each value requires a good deal of extra memory
  2. Due to the fact that each "node" in the linked list can be allocated far away from the previous one, navigating through the list can be slow due to its bad memory locality properties
  3. Adding new values always require new memory allocations and pointers being set, hindering
    performance

On the other hand, using a slice as the underlying data structure for unbounded queues has the advantage of very good memory locality, making retrieval of values faster when comparing to linked lists. Also an "alloc more than needed right now" approach can easily be implemented with slices.

However, when the slice needs to expand to accommodate new values, a well adopted strategy is to allocate a new, larger slice, copy over all elements from the previous slice into the new one and use the new one to add the new elements.

The problem with this approach is the obvious need to copy all the values from the older, small slice, into the new one, yielding a poor performance when the amount of values that need copying are fairly large.

Another potential problem is a theoretical lower limit on how much data they can hold as slices, like arrays, have to allocate its specified positions in sequential memory addresses, so the maximum number of items the queue would ever be able to hold is the maximum size a slice can be allocated on that particular system at any given moment. Due to modern memory management techniques such as virtual memory and paging, this is a very hard scenario to corroborate thru practical testing.

Nonetheless, this approach doesn't scale well with large data sets.

Having said that, there's a third, newer approach to implementing unbounded queues: use fixed size linked slices as the underlying data structure.

The fixed size linked slices approach is a hybrid between the first two, providing good memory locality arrays have alongside the efficient growing mechanism linked lists offer. It is also not limited on the maximum size a slice can be allocated, being able to hold and deal efficiently with a theoretical much larger amount of data than pure slice based implementations.

Rationale

Research

A first implementation of the new design was built.

The benchmark tests showed the new design was very promising, so I decided to research about other possible queue designs and implementations with the goal to improve the first design and implementation.

As part of the research to identify the best possible queue designs and implementations, I implemented and probed a total of 7 experimental queue implementations. Below are a few of the most interesting ones.

  • queueimpl1: custom queue implementation that stores the values in a simple slice. Pop removes the first slice element. This is a slice based implementation that tests this suggestion.
  • queueimpl5: custom queue implementation that stores the values in linked slices. This implementation tests the queue performance when storing the "next" pointer as part of the values slice instead of having it as a separate "next" field. The next element is stored in the last position of the internal slices, which is a reserved position.
  • queueimpl7: custom queue implementation that stores the values in linked slices. This implementation tests the queue performance when performing lazy creation of the internal slice as well as starting with a 1-sized slice, allowing it to grow up to 16 by using the built in append function. Subsequent slices are created with 128 fixed size.

Also as part of the research, I investigated and probed below open source queue implementations as well.

  • phf: this is a slice, ring based queue implementation. Interesting to note the author did a pretty good job researching and probing other queue implementations as well.
  • gammazero: the deque implemented in this package is also a slice, ring based queue implementation.
  • juju: the deque implemented in this package uses a linked list based approach, similarly to other experimental implementations in this package such as queueimpl3. The biggest difference between this implementation and the other experimental ones is the fact that this queue uses the standard list package as the linked list. The standard list package implements a doubly linked list, while the experimental implementations implements their own singly linked list.

The standard list package as well as buffered channels were probed as well.

Benchmark Results

Add and remove 100 items
Performance

ns/op

Memory

B/op

Add and remove 100k items
Performance

ns/op

Memory

B/op

Aggregated Results
Performance

ns/op

Memory

B/op

Detailed, curated results can be found here

Aggregated, curated results can be found here

Given above results, queueimpl7, henceforth just "impl7", proved to be the most balanced implementation, being either faster or very competitive in all test scenarios from a performance and memory perspective.

Refer here for more details about the tests.

The benchmark tests can be found here.

Impl7 Design and Implementation

Impl7 was the result of the observation that some slice based implementations such as queueimpl1 and
queueimpl2 offers phenomenal performance when the queue is used with small data sets.

For instance, comparing queueimpl3 (very simple linked slice implementation) with queueimpl1 (very simple slice based implementation), the results at adding 0 (init time only), 1 and 10 items are very favorable for impl1, from a performance and memory perspective.

benchstat rawresults/bench-impl1.txt rawresults/bench-impl3.txt
name       old time/op    new time/op    delta
/0-4         6.83ns ± 3%  472.53ns ± 7%   +6821.99%  (p=0.000 n=20+17)
/1-4         48.1ns ± 6%   492.4ns ± 5%    +924.66%  (p=0.000 n=20+20)
/10-4         532ns ± 5%     695ns ± 8%     +30.57%  (p=0.000 n=20+20)
/100-4       3.19µs ± 2%    2.50µs ± 4%     -21.69%  (p=0.000 n=18+19)
/1000-4      24.5µs ± 3%    23.6µs ± 2%      -3.33%  (p=0.000 n=19+19)
/10000-4      322µs ± 4%     238µs ± 1%     -26.02%  (p=0.000 n=19+18)
/100000-4    15.8ms ±10%     3.3ms ±13%     -79.32%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B       2080.00B ± 0%       +Inf%  (p=0.000 n=20+20)
/1-4          16.0B ± 0%   2080.0B ± 0%  +12900.00%  (p=0.000 n=20+20)
/10-4          568B ± 0%     2152B ± 0%    +278.87%  (p=0.000 n=20+20)
/100-4       4.36kB ± 0%    2.87kB ± 0%     -34.13%  (p=0.000 n=20+20)
/1000-4      40.7kB ± 0%    24.6kB ± 0%     -39.54%  (p=0.000 n=20+20)
/10000-4      746kB ± 0%     244kB ± 0%     -67.27%  (p=0.000 n=20+20)
/100000-4    10.0MB ± 0%     2.4MB ± 0%     -75.85%  (p=0.000 n=15+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           2.00 ± 0%       +Inf%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%    +100.00%  (p=0.000 n=20+20)
/10-4          14.0 ± 0%      11.0 ± 0%     -21.43%  (p=0.000 n=20+20)
/100-4          108 ± 0%       101 ± 0%      -6.48%  (p=0.000 n=20+20)
/1000-4       1.01k ± 0%     1.01k ± 0%      +0.50%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%      +1.35%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%      +1.53%  (p=0.000 n=20+20)

Impl7 is a hybrid experiment between using a simple slice based queue implementation for small data sets and the fixed size linked slice approach for large data sets, which is an approach that scales really well, offering really good performance for small and large data sets.

The implementation starts by lazily creating the first slice to hold the first values added to the queue.

const (
    // firstSliceSize holds the size of the first slice.
    firstSliceSize = 1

    // maxFirstSliceSize holds the maximum size of the first slice.
    maxFirstSliceSize = 16

    // maxInternalSliceSize holds the maximum size of each internal slice.
    maxInternalSliceSize = 128
)

...

// Push adds a value to the queue.
// The complexity is amortized O(1).
func (q *Queueimpl7) Push(v interface{}) {
    if q.head == nil {
        h := newNode(firstSliceSize) // Returns a 1-sized slice.
        q.head = h
        q.tail = h
        q.lastSliceSize = maxFirstSliceSize
    } else if len(q.tail.v) >= q.lastSliceSize {
        n := newNode(maxInternalSliceSize) // Returns a 128-sized slice.
        q.tail.n = n
        q.tail = n
        q.lastSliceSize = maxInternalSliceSize
    }

    q.tail.v = append(q.tail.v, v)
    q.len++
}

...

// newNode returns an initialized node.
func newNode(capacity int) *Node {
    return &Node{
        v: make([]interface{}, 0, capacity),
    }
}

The very first created slice is created with capacity 1. The implementation allows the builtin append function to dynamically resize the slice up to 16 (maxFirstSliceSize) positions. After that it reverts to creating fixed size 128 position slices, which offers the best performance for data sets above 16 items.

16 items was chosen as this seems to provide the best balanced performance for small and large data sets according to the array size benchmark tests. Above 16 items, growing the slice means allocating a new, larger one and copying all 16 elements from the previous slice into the new one. The append function phenomenal performance can only compensate for the added copying of elements if the data set is very small, no more than 8 items in the benchmark tests. For above 8 items, the fixed size slice approach is consistently faster and uses less memory, where 128 sized slices are allocated and linked together when the data structure needs to scale to accommodate new values.

Why 16? Why not 15 or 14?

The builtin append function, as of "go1.11 darwin/amd64", seems to double the slice size every time it needs to allocate a new one.

ts := make([]int, 0, 1)

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 1 item; output: 1

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 2 items; output: 2

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 3 items; output: 4

ts = append(ts, 1)
ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 5 items; output: 8

ts = append(ts, 1)
ts = append(ts, 1)
ts = append(ts, 1)
ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 9 items; output: 16

Since the append function will resize the slice from 8 to 16 positions, it makes sense to use all 16 already allocated positions before switching to the fixed size slices approach.

Design Considerations

Impl7 uses linked slices as its underlying data structure.

The reason for the choice comes from two main observations of slice based queues:

  1. When the queue needs to expand to accommodate new values, a new, larger slice needs to be allocated and used
  2. Allocating and managing large slices is expensive, especially in an overloaded system with little available physical memory

To help clarify the scenario, below is what happens when a slice based queue that already holds, say 1bi items, needs to expand to accommodate a new item.

Slice based implementation

  • Allocate a new, twice the size of the previous allocated one, say 2 billion positions slice
  • Copy over all 1 billion items from the previous slice into the new one
  • Add the new value into the first unused position in the new slice, position 1000000001.

The same scenario for impl7 plays out like below.

Impl7

  • Allocate a new 128 size slice
  • Set the next pointer
  • Add the value into the first position of the new slice, position 0

Impl7 never copies data around, but slice based ones do, and if the data set is large, it doesn't matter how fast the copying algorithm is. The copying has to be done and will take some time.

The decision to use linked slices was also the result of the observation that slices goes to great length to provide predictive, indexed positions. A hash table, for instance, absolutely need this property, but not a queue. So impl7 completely gives up this property and focus on what really matters: add to end, retrieve from head. No copying around and repositioning of elements is needed for that. So when a slice goes to great length to provide that functionality, the whole work of allocating new arrays, copying data around is all wasted work. None of that is necessary. And this work costs dearly for large data sets as observed in the tests.

Impl7 Benchmark Results

Below compares impl7 with a few selected implementations.

The tests name are formatted given below.

  • Benchmark/N-4: benchmark a queue implementation where N denotes the number of items added and removed to/from the queue; 4 means the number of CPU cores in the host machine.

Examples:

  • Benchmark/0-4: benchmark the queue by creating a new instance of it. This only test initialization time.
  • Benchmark/100-4: benchmark the queue by creating a new instance of it and adding and removing 100 items to/from the queue.

Standard list used as a FIFO queue vs impl7.

benchstat rawresults/bench-list.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         34.9ns ± 1%     1.2ns ± 3%   -96.64%  (p=0.000 n=19+20)
/1-4         77.0ns ± 1%    68.3ns ± 1%   -11.21%  (p=0.000 n=20+20)
/10-4         574ns ± 0%     578ns ± 0%    +0.59%  (p=0.000 n=18+20)
/100-4       5.94µs ± 1%    3.07µs ± 0%   -48.28%  (p=0.000 n=19+18)
/1000-4      56.0µs ± 1%    25.8µs ± 1%   -53.92%  (p=0.000 n=20+20)
/10000-4      618µs ± 1%     260µs ± 1%   -57.99%  (p=0.000 n=20+18)
/100000-4    13.1ms ± 6%     3.1ms ± 3%   -76.50%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          48.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4          96.0B ± 0%     48.0B ± 0%   -50.00%  (p=0.000 n=20+20)
/10-4          600B ± 0%      600B ± 0%      ~     (all equal)
/100-4       5.64kB ± 0%    3.40kB ± 0%   -39.72%  (p=0.000 n=20+20)
/1000-4      56.0kB ± 0%    25.2kB ± 0%   -55.10%  (p=0.000 n=20+20)
/10000-4      560kB ± 0%     243kB ± 0%   -56.65%  (p=0.000 n=20+20)
/100000-4    5.60MB ± 0%    2.43MB ± 0%   -56.66%  (p=0.000 n=18+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
/10-4          20.0 ± 0%      15.0 ± 0%   -25.00%  (p=0.000 n=20+20)
/100-4          200 ± 0%       107 ± 0%   -46.50%  (p=0.000 n=20+20)
/1000-4       2.00k ± 0%     1.02k ± 0%   -48.95%  (p=0.000 n=20+20)
/10000-4      20.0k ± 0%     10.2k ± 0%   -49.20%  (p=0.000 n=20+20)
/100000-4      200k ± 0%      102k ± 0%   -49.22%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~29x faster (1.2ns vs 34.9ns) than list package for init time (0 items)
  • Up to ~4x faster (3.1ms vs 13.1ms) than list package for 100k items
  • Uses ~1/2 memory (2.43MB vs 5.60MB) than list package for 100k items

impl1 (simple slice based queue implementaion) vs impl7.

benchstat rawresults/bench-impl1.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         6.83ns ± 3%    1.18ns ± 3%   -82.79%  (p=0.000 n=20+20)
/1-4         48.1ns ± 6%    68.3ns ± 1%   +42.23%  (p=0.000 n=20+20)
/10-4         532ns ± 5%     578ns ± 0%    +8.55%  (p=0.000 n=20+20)
/100-4       3.19µs ± 2%    3.07µs ± 0%    -3.74%  (p=0.000 n=18+18)
/1000-4      24.5µs ± 3%    25.8µs ± 1%    +5.51%  (p=0.000 n=19+20)
/10000-4      322µs ± 4%     260µs ± 1%   -19.23%  (p=0.000 n=19+18)
/100000-4    15.8ms ±10%     3.1ms ± 3%   -80.60%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B          0.00B           ~     (all equal)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4          568B ± 0%      600B ± 0%    +5.63%  (p=0.000 n=20+20)
/100-4       4.36kB ± 0%    3.40kB ± 0%   -22.02%  (p=0.000 n=20+20)
/1000-4      40.7kB ± 0%    25.2kB ± 0%   -38.25%  (p=0.000 n=20+20)
/10000-4      746kB ± 0%     243kB ± 0%   -67.47%  (p=0.000 n=20+20)
/100000-4    10.0MB ± 0%     2.4MB ± 0%   -75.84%  (p=0.000 n=15+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           0.00           ~     (all equal)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          14.0 ± 0%      15.0 ± 0%    +7.14%  (p=0.000 n=20+20)
/100-4          108 ± 0%       107 ± 0%    -0.93%  (p=0.000 n=20+20)
/1000-4       1.01k ± 0%     1.02k ± 0%    +1.09%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.39%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~5x faster (1.18ns vs 6.83ns) than impl1 for init time (0 items)
  • Up to ~5x faster (3.1ms vs 15.8ms) than impl1 for 100k items
  • Uses ~1/4 memory (2.4MB vs 10MB) than impl1 for 100k items

It's important to note that the performance and memory gains for impl7 is exponential like the larger the data set is due to the fact slice based implementations doesn't scale well, paying a higher and higher price, performance and memory wise, every time it needs to scale to accommodate an ever
expanding data set.


phf (slice, ring based FIFO queue implementation) vs impl7.

benchstat rawresults/bench-phf.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         28.1ns ± 1%     1.2ns ± 3%   -95.83%  (p=0.000 n=20+20)
/1-4         42.5ns ± 1%    68.3ns ± 1%   +60.80%  (p=0.000 n=20+20)
/10-4         681ns ± 1%     578ns ± 0%   -15.11%  (p=0.000 n=18+20)
/100-4       4.55µs ± 1%    3.07µs ± 0%   -32.45%  (p=0.000 n=19+18)
/1000-4      35.5µs ± 1%    25.8µs ± 1%   -27.32%  (p=0.000 n=18+20)
/10000-4      349µs ± 2%     260µs ± 1%   -25.67%  (p=0.000 n=20+18)
/100000-4    11.7ms ±11%     3.1ms ± 3%   -73.77%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          16.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4          696B ± 0%      600B ± 0%   -13.79%  (p=0.000 n=20+20)
/100-4       6.79kB ± 0%    3.40kB ± 0%   -49.94%  (p=0.000 n=20+20)
/1000-4      57.0kB ± 0%    25.2kB ± 0%   -55.86%  (p=0.000 n=20+20)
/10000-4      473kB ± 0%     243kB ± 0%   -48.68%  (p=0.000 n=20+20)
/100000-4    7.09MB ± 0%    2.43MB ± 0%   -65.77%  (p=0.000 n=18+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          15.0 ± 0%      15.0 ± 0%      ~     (all equal)
/100-4          111 ± 0%       107 ± 0%    -3.60%  (p=0.000 n=20+20)
/1000-4       1.02k ± 0%     1.02k ± 0%    +0.39%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.38%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~23x faster (1.2ns vs 28.1ns) than phf for init time (0 items)
  • Up to ~3x faster (3.1ms vs 11.7ms) than phf for 100k items
  • Uses ~1/2 memory (2.43MB vs 7.09MB) than phf for 100k items

Buffered channel vs impl7.

benchstat rawresults/bench-channel.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         30.2ns ± 1%     1.2ns ± 3%   -96.12%  (p=0.000 n=19+20)
/1-4         87.6ns ± 1%    68.3ns ± 1%   -22.00%  (p=0.000 n=19+20)
/10-4         704ns ± 1%     578ns ± 0%   -17.90%  (p=0.000 n=20+20)
/100-4       6.78µs ± 1%    3.07µs ± 0%   -54.70%  (p=0.000 n=20+18)
/1000-4      67.3µs ± 1%    25.8µs ± 1%   -61.65%  (p=0.000 n=20+20)
/10000-4      672µs ± 1%     260µs ± 1%   -61.36%  (p=0.000 n=19+18)
/100000-4    6.76ms ± 1%    3.07ms ± 3%   -54.61%  (p=0.000 n=19+20)

name       old alloc/op   new alloc/op   delta
/0-4          96.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4           112B ± 0%       48B ± 0%   -57.14%  (p=0.000 n=20+20)
/10-4          248B ± 0%      600B ± 0%  +141.94%  (p=0.000 n=20+20)
/100-4       1.69kB ± 0%    3.40kB ± 0%  +101.42%  (p=0.000 n=20+20)
/1000-4      16.2kB ± 0%    25.2kB ± 0%   +55.46%  (p=0.000 n=20+20)
/10000-4      162kB ± 0%     243kB ± 0%   +49.93%  (p=0.000 n=20+20)
/100000-4    1.60MB ± 0%    2.43MB ± 0%   +51.43%  (p=0.000 n=16+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          10.0 ± 0%      15.0 ± 0%   +50.00%  (p=0.000 n=20+20)
/100-4          100 ± 0%       107 ± 0%    +7.00%  (p=0.000 n=20+20)
/1000-4       1.00k ± 0%     1.02k ± 0%    +2.10%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.61%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.57%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~25x faster (1.2ns vs 30.2ns) than channels for init time (0 items)
  • Up to ~2x faster (3.07ms vs 6.76ms) than channels for 100k items
  • Uses ~50% MORE memory (2.43MB vs 1.60MB) than channels for 100k items

Above is not really a fair comparison as standard buffered channels doesn't scale (at all) and they are meant for routine synchronization. Nonetheless, they can and make for an excellent bounded FIFO queue option. Still, impl7 is consistently faster than channels across the board, but uses considerably more memory than channels.


Given its excellent performance under all scenarios, the hybrid approach impl7 seems to be the ideal candidate for a high performance, low memory footprint general purpose FIFO queue.

For above reasons, I propose to port impl7 to the standard library.

All raw benchmark results can be found here.

Internal Slice Size

Impl7 uses linked slices as its underlying data structure.

The size of the internal slice does influence performance and memory consumption significantly.

According to the internal slice size bench tests, larger internal slice sizes yields better performance and lower memory footprint. However, the gains diminishes dramatically as the slice size increases.

Below are a few interesting results from the benchmark tests.

BenchmarkMaxSubsequentSliceSize/1-4                20000         76836 ns/op       53967 B/op       2752 allocs/op
BenchmarkMaxSubsequentSliceSize/2-4                30000         59811 ns/op       40015 B/op       1880 allocs/op
BenchmarkMaxSubsequentSliceSize/4-4                30000         42925 ns/op       33039 B/op       1444 allocs/op
BenchmarkMaxSubsequentSliceSize/8-4                50000         36946 ns/op       29551 B/op       1226 allocs/op
BenchmarkMaxSubsequentSliceSize/16-4               50000         30597 ns/op       27951 B/op       1118 allocs/op
BenchmarkMaxSubsequentSliceSize/32-4               50000         28273 ns/op       27343 B/op       1064 allocs/op
BenchmarkMaxSubsequentSliceSize/64-4               50000         26969 ns/op       26895 B/op       1036 allocs/op
BenchmarkMaxSubsequentSliceSize/128-4              50000         27316 ns/op       26671 B/op       1022 allocs/op
BenchmarkMaxSubsequentSliceSize/256-4              50000         26221 ns/op       28623 B/op       1016 allocs/op
BenchmarkMaxSubsequentSliceSize/512-4              50000         25882 ns/op       28559 B/op       1012 allocs/op
BenchmarkMaxSubsequentSliceSize/1024-4             50000         25674 ns/op       28527 B/op       1010 allocs/op

Given the fact that larger internal slices also means potentially more unused memory in some scenarios, 128 seems to be the perfect balance between performance and worst case scenario for memory footprint.

Full results can be found here.

API

Impl7 implements below API methods.

Operation Method
Add func (q *Queueimpl7) Push(v interface{})
Remove func (q *Queueimpl7) Pop() (interface{}, bool)
Size func (q *Queueimpl7) Len() int
Return First func (q *Queueimpl7) Front() (interface{}, bool)

As nil values are considered valid queue values, similarly to the map data structure, "Front" and "Pop" returns a second bool parameter to indicate whether the returned value is valid and whether the queue is empty or not.

The reason for above method names and signatures are the need to keep compatibility with existing Go data structures such as the list, ring and heap packages.

Below are the method names used by the existing list, ring and heap Go data structures, as well as the new proposed queue.

Operation list ring heap queue
Add PushFront/PushBack Link Push Push
Remove Remove Unlink Pop Pop
Size Len Len - Len
Return First Front - - Front

For comparison purposes, below are the method names for C++, Java and C# for their queue implementation.

Operation C++ Java C#
Add push add/offer Enqueue
Remove pop remove/poll Dequeue
Size size - Count
Return First front peek Peek

Drawbacks

The biggest drawback of the proposed implementation is the potentially extra allocated but not used memory in its head and tail slices.

This scenario realizes when exactly 17 items are added to the queue, causing the creation of a full sized internal slice of 128 positions. Initially only the first element in this new slice is used to store the added value. All the other 127 elements are already allocated, but not used.

// Assuming a 128 internal sized slice.
q := queueimpl7.New()

// Push 16 items to fill the first dynamic slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Push(i)
}
// Push 1 extra item that causes the creation of a new 128 sized slice to store this value.
q.Push(17)

// Pops the first 16 items to release the first slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Pop()
}

// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
// we need to manually calculate the memory used by the internal slices.
var internalSliceType interface{}
fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *127 /* head slice unused positions */)))

// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 2040 bytes

The worst case scenario realizes when exactly 145 items are added to the queue and 143 items are removed. This causes the queue struct to hold a 128-sized slice as its head slice, but only the last element is actually used. Similarly, the queue struct will hold a separate 128-sized slice as its tail slice, but only the first position in that slice is being used.

// Assuming a 128 internal sized slice.
q := queueimpl7.New()

// Push 16 items to fill the first dynamic slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Push(i)
}

// Push an additional 128 items to fill the first full sized slice (sized 128).
for i := 1; i <= 128; i++ {
   q.Push(i)
}

// Push 1 extra item that causes the creation of a new 128 sized slice to store this value,
// adding a total of 145 items to the queue.
q.Push(1)

// Pops the first 143 items to release the first dynamic slice (sized 16) and
// 127 items from the first full sized slice (sized 128).
for i := 1; i <= 143; i++ {
   q.Pop()
}

// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
// we need to manually calculate the memory used by the internal slices.
var internalSliceType interface{}
fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *(127 /* head slice unused positions */ +127 /* tail slice unused positions */))))

// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 4072 bytes

Above code was run on Go version "go1.11 darwin/amd64".

Open Questions/Issues

Should this be a deque (double-ended queue) implementation instead? The deque could be used as a stack as well, but it would make more sense to have a queue and stack implementations (like most mainstream languages have) instead of a deque that can be used as a stack (confusing). Stack is a very important computer science data structure as well and so I believe Go should have a specialized implementation for it as well (given the specialized implementation offers real value to the users and not just a "nice" named interface and methods).

Should "Pop" and "Front" return only the value instead of the value and a second bool parameter (which indicates whether the queue is empty or not)? The implication of the change is adding nil values wouldn't be valid anymore so "Pop" and "Front" would return nil when the queue is empty. Panic should be avoided in libraries.

The memory footprint for a 128 sized internal slice causes, in the worst case scenario, a 2040 bytes of memory allocated (on a 64bit system) but not used. Switching to 64 means roughly half the memory would be used with a slight ~2.89% performance drop (252813ns vs 260137ns). The extra memory footprint is not worth the extra performance gain is a very good point to make. Should we change this value to 64 or maybe make it configurable?

Should we also provide a safe for concurrent use implementation? A specialized implementation that would rely on atomic operations to update its internal indices and length could offer a much better performance when comparing to a similar implementation that relies on a mutex.

With the impending release of generics, should we wait to release the new queue package once the new generics framework is released?

Should we implement support for the range keyword for the new queue? It could be done in a generic way so other data structures could also benefit from this feature. For now, IMO, this is a topic for another proposal/discussion.

Summary

I propose to add a new package, "container/queue", to the standard library to support an in-memory, unbounded, general purpose queue implementation.

I feel strongly this proposal should be accepted due to below reasons.

  1. The proposed solution was well researched and probed, being dramatically and consistently faster than 6 other experimental queue implementations as well 3 promising open source queue implementations as well the standard list package and buffered channels; it still consumes considerably less memory than every other queue implementation tested, except for buffered channels
  2. The proposed solution uses a new, unique approach to building queues, yet its implementation is clean and extremely simple. Both main methods, "Push" and "Pop", are composed of only 16 and 19 lines of code (total), respectively. The proposed implementation also have proper tests with 100% test coverage and should require minimal maintenance moving forward
  3. I'll implement any changes the Go community feel are needed for the proposed solution to be worth of the standard library and the Go community

Update 11/26/2018.

Due to many suggestions to make the queue a deque and to deploy it as a proper external package, the deque package was built and deployed here. The proposal now is to add the deque package to the standard library instead of impl7. Refer here for details.

@gopherbot gopherbot added this to the Proposal milestone Sep 29, 2018
@cznic
Copy link
Contributor

cznic commented Sep 29, 2018

Unbound queues don't fit well machines with finite memory.

@christianrpetrin
Copy link
Author

christianrpetrin commented Sep 30, 2018

@cznic I get your point, but I disagree with it for a number of reasons.

  1. Modern devices, even mobile ones, comes with gigabytes of memory available. Efficient data structures that allow engineers to use that much memory efficiently are very welcome
  2. Go already offers a few unbounded data structures such as list, map and others. I don't understand why Go should stop now
  3. In modern cloud computing, it is fairly easy to grow memory usage by distributing the growth to more than one device. The technique I'm using on CloudLogger, for instance, is to allow for infinite memory growth in a single VM, but configuring Kubernetes to auto-scale based on memory. So when a node reaches a certain limit, say 50%, of its used memory, Kubernetes will deploy a new pod and start to round robin the requests (which causes the memory usage to go up) between the old and new pod. This way I effectively (and automatically, with no code changes) can scale one device memory into many others. Techniques like this are bread and butter of modern cloud computing. An amazing language like Go should not restrict the engineers in being able to use such techniques

Maybe the term "unbounded" is a bit confusing. Maybe we should change from "unbounded queue" to "dynamically growing queue". Dynamically growing, which in the queue scenario means the same thing as unbounded, would probably be less confusing. Thoughts?

@networkimprov
Copy link

networkimprov commented Sep 30, 2018

Agreed that the stdlib or golang.org/x repo would benefit from containers like queue and btree. It may become easier to land new containers once generics lands, tho that's a long way off yet.

Dynamic or "elastic" channel buffers have been proposed (and declined) in two forms: a channel buffer pool #20868, and unlimited channel buffer #20352. A workable alternative is one-goroutine-plus-two-channels as in this elastic channel example. It might be useful to include that in your benchmark set.

Container structures are typically unbounded/elastic, so that's unlikely to be a recurring objection here :-)

@christianrpetrin
Copy link
Author

christianrpetrin commented Sep 30, 2018

@networkimprov, thanks for posting the other channel proposals, and agreeing with the queue proposal! :-)

I looked into https://github.com/kylelemons/iq and both queue designs implemented here, a slice, ring based design as well as the dynamic channel.

I have benchmarked two high quality open source, similarly slice ring based implementations: phf and gammazero.

I have also benchmarked channels as queue.

Impl7 is dramatically faster than all ring slice based implementations tested as well as channels.

Having said that, I tried to benchmark both implementations, but the design for both queues is so "unique" I had trouble getting both implementations to fit my benchmark tests.

I couldn't help but to notice the "Infinite Queue implementations in Go. These are bad news, you should always use bounded queues, but they are here for educational purposes." title on the repo.

Blanket statements like this that are based on faulty research really doesn't help. I wish the author would do a better job researching other design and implementation options before making such statements.

@cznic
Copy link
Contributor

cznic commented Sep 30, 2018

Blanket statements like this that are based on faulty research really doesn't help.

The above is a blanket statement because it claims some undefined research is faulty without providing any evidence ;-)

@phf
Copy link
Contributor

phf commented Sep 30, 2018

Thanks for noticing my queue. I obviously agree that a queue should be in the standard library. Personally I believe that a deque is what you want in the standard library. It's usually not much more work to implement (or much more complex code-wise) but it's twice as useful and not (in principle) harder to make fast. The proposal sounds good and I wish it success. Oh, and congrats on beating my queue! :-)

@wsc1
Copy link

wsc1 commented Sep 30, 2018

Nice queue!

However, my feeling is that the state of affairs of containers and the standard library is such that a global assessment of what the standard library should provide for containers before considering inclusion based on the merits of a single implementation of a single data structure.

@networkimprov
Copy link

Re benchmarking kylelemons/iq, its two channels simulate a single standard channel (one is input, the other output). But you already knew that, so the issue must be the goroutine?

Re its claim, "you should always use bounded queues," that means channel queues. As channels are principally an "IGrC" mechanism, if the consumer goroutine stops, the producer/s must block on c <- x (iow feel "backpressure").

I gather the API isn't thread safe? Have you benchmarked a concurrent scenario, e.g. multiple pushers & single popper? That's where a comparison with kylelemons/iq is most relevant. (And that's how I'd use your queue.)

Related: @ianlancetaylor once voiced support for the idea of allocating a bounded channel buffer as-needed, and your list-of-slices technique might apply there. There's not a separate proposal for that as yet.

@egonelbre
Copy link
Contributor

egonelbre commented Oct 1, 2018

I remember https://github.com/karalabe/cookiejar/tree/master/collections having a decent implementation.

Good concurrent queue will need several things i.e. lock-freedom in the happy bath, non-spinning when blocking. For different concurrent situations (MPMC, SPMC, MPSC, SPSC) and high-performance, the code would also look different; since not having to manage concurrency at a particular producer/consumer helps to speed up things.

As for tests you should also test the slowly increasing cases (i.e. 2 x Push, 1 x Pop or 3 x Push, 2 x Pop) and slowly decreasing cases (i.e. fill queue, then 1 x Push, 2 x Pop or 2 x Push, 3 x Pop). Also this document is not including the stable-cases (fill 1e6, or larger depending on internal block size, then repeat for N 1 x Push, 2 x Pop). Also missing large cases with 10e6 items. And average might not be the best measurement for these, look at percentiles also.

@bcmills
Copy link
Contributor

bcmills commented Oct 1, 2018

Should this be a deque (double-ended queue) implementation instead?

I believe that the two-array deque is conventional in language libraries. It tends to provide better cache behavior than a linked-list approach, and a lower cost (particularly in the face of mixed push-front/push-back usage) than a single-array queue.

@bcmills
Copy link
Contributor

bcmills commented Oct 1, 2018

Good concurrent queue will need several things i.e. lock-freedom in the happy bath, non-spinning when blocking.

“Lock-freedom” is not particularly meaningful for programs that run on multiple cores. The CPU's cache coherence protocol has many of the same properties as a lock.

@bcmills
Copy link
Contributor

bcmills commented Oct 1, 2018

As @cznic notes, unbounded queues are not usually what you want in a well-behaved program anyway.

For example, in a Logger implementation, you don't want to consume arbitrary memory and crash (and lose your pending logs) if one of your logging backends can't keep up. If the logs are important, it's better to apply flow-control and delay the program until they can be sent. If the logs are not important, it's better to drop some logs when the buffer gets full, rather than dropping the entire program state.

@bcmills
Copy link
Contributor

bcmills commented Oct 1, 2018

Given the number and variety of queue implementations that already exist, and their variation on properties such as push-front support and concurrency-safety, I would prefer that we not add a queue to the standard library — especially not before we figure out what's going on with generics (#15292).

@bcmills bcmills added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 1, 2018
@uluyol
Copy link
Contributor

uluyol commented Oct 1, 2018

I agree that when used for concurrency, unbounded queues are a bad fit. However, unbounded queues are useful for sequential work e.g. when doing a breadth first search. We use slices that can grow arbitrarily, maps that can grow arbitrarily, and stacks that can grow arbitrarily, why can't the programmer be trusted to use them when appropriate?

I don't understand what the push back is on the idea (though I agree that it should wait on generics). An unbounded queue is not an unbounded channel.

@christianrpetrin
Copy link
Author

Thanks @egonelbre for pointing out to this https://github.com/karalabe/cookiejar/blob/master/collections/queue/queue.go
queue implementation. This is an interesting design based on a dynamically growing circular slice of blocks.

I have added this queue to the bench tests. Feel free to clone the repo and run the tests to validate below results by yourself.

benchstat rawresults/bench-cookiejar.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         9.47µs ± 0%    0.00µs ± 3%   -99.99%  (p=0.000 n=19+20)
/1-4         9.52µs ± 1%    0.07µs ± 1%   -99.28%  (p=0.000 n=18+20)
/10-4        9.76µs ± 1%    0.58µs ± 0%   -94.08%  (p=0.000 n=19+20)
/100-4       11.6µs ± 1%     3.1µs ± 0%   -73.54%  (p=0.000 n=20+18)
/1000-4      29.6µs ± 2%    25.8µs ± 1%   -12.71%  (p=0.000 n=20+20)
/10000-4      231µs ± 1%     260µs ± 1%   +12.56%  (p=0.000 n=19+18)
/100000-4    3.21ms ±13%    3.07ms ± 3%    -4.41%  (p=0.000 n=18+20)

name       old alloc/op   new alloc/op   delta
/0-4         65.6kB ± 0%     0.0kB       -100.00%  (p=0.000 n=20+20)
/1-4         65.6kB ± 0%     0.0kB ± 0%   -99.93%  (p=0.000 n=20+20)
/10-4        65.6kB ± 0%     0.6kB ± 0%   -99.09%  (p=0.000 n=20+20)
/100-4       66.4kB ± 0%     3.4kB ± 0%   -94.88%  (p=0.000 n=20+20)
/1000-4      73.6kB ± 0%    25.2kB ± 0%   -65.80%  (p=0.000 n=20+20)
/10000-4      277kB ± 0%     243kB ± 0%   -12.29%  (p=0.000 n=20+20)
/100000-4    2.45MB ± 0%    2.43MB ± 0%    -0.79%  (p=0.000 n=20+20)

name       old allocs/op  new allocs/op  delta
/0-4           2.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
/10-4          11.0 ± 0%      15.0 ± 0%   +36.36%  (p=0.000 n=20+20)
/100-4          101 ± 0%       107 ± 0%    +5.94%  (p=0.000 n=20+20)
/1000-4       1.00k ± 0%     1.02k ± 0%    +2.00%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.56%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.52%  (p=0.000 n=20+20)

This queue does have good performance for fairly large data sets, but its performance for small data sets, performance and memory wise, are not very good.

This is a very good implementation, I really like it, but:

  1. Due to its poor performance with small data sets, I would not regard this as a well balanced, general purpose queue
  2. I have concerns regarding memory leaks on this queue. I don't see anywhere in the code where the
    just released position in the internal slice gets set to nil
  3. This queue suffers from the same problem as every other pure slice based queues suffer: copying of existing elements when the queue needs to grow

@christianrpetrin
Copy link
Author

christianrpetrin commented Oct 1, 2018

@networkimprov the kylelemons/iq is a very unique design which seems to me was built to prove a point the channels does not make good unbounded queues.

I can't agree more with this. Channels are meant for routine synchronization, not as a sequential data store like a traditional queue can, which by the way, is a core use case for queues (if you can't process fast enough, store). That's why I was actually adamant in including buffered channels in the tests: they are not meant to be used as queues. If people really want to use them as queues, then a separate proposal with a separate discussion should follow. This is not a "dynamic channel" proposal. Please keep that in mind.

Regarding the claim, "you should always use bounded queues,", I understood (by reading the code) that he/she meant channels. My point was that his/her statement is not clear and bounded enough, so people reading might think no unbounded queues, regardless of its design, implementation, performance, etc are good. I can't agree with that.

Impl7 is not safe for concurrent use. As the map and all stdlib containers are not safe for concurrent use, the new queue should follow on the pattern. Not all users will need a safe for concurrent use queue. So IMO, a safe for concurrent use version is a topic for another proposal/discussion just like we have a separate sync.Map for a safe for concurrent use map implementation.

I do talk a bit about the subject on the design document though.

@christianrpetrin
Copy link
Author

christianrpetrin commented Oct 1, 2018

As @cznic notes, unbounded queues are not usually what you want in a well-behaved program anyway.

For example, in a Logger implementation, you don't want to consume arbitrary memory and crash (and lose your pending logs) if one of your logging backends can't keep up. If the logs are important, it's better to apply flow-control and delay the program until they can be sent. If the logs are not important, it's better to drop some logs when the buffer gets full, rather than dropping the entire program state.

That's sort of the plan. Take a look at this response. I can't apply flow control all the way to the clients. This would make the hosting system slower, affecting its clients (the human beings). Non-critical infrastructure such as logging should not affect the hosting system significantly. This is a problem that has no solution other than to either queue or discard the logs. Discard should be the very last approach, only used when queueing is no longer possible. FWIW, CloudLogger already treats important logs in a very different way than regular logs, so even if one of the backend servers dies and a whole bunch of logs get lost, that's by design. All the important logs are not long term cached and will not be lost (in large numbers, at least).

Also I get your concerns around using too much memory, but think of this queue as a data structure that is highly efficient also in dealing with large amounts of data, gigabytes of data. It has to be unbounded because each device has a certain amount of available memory to be used at any given time. It should be up to the engineer using the queue to decide how much memory he/she wants to use. It's the job of the language designers to provide tools that allow the engineers to handle their needs with the maximum possible efficient, and not dictate how much memory their programs should use.

If we do have and can provide to all Go engineers a data structure that is flexible enough they know they will get really good performance no matter their needs, I think that data structure brings real value. Go engineers have been building their queues for a long time now. Often with (pretty bad results)[https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md]. That's also my point here. Go engineers are already doing "the bad" thing of using a lot of memory. The proposed solution just allows them to keep doing that in a much more efficient way.

Again, I'm not proposing a super crazy, specialized data structure here. Coming from Java and C#, I was super surprised Go doesn't offer a traditional queue implementation. Having a traditional queue implementation, especially a super flexible and fast one, also helps bring all those Java, C#, etc devs to the Go world.

@christianrpetrin
Copy link
Author

christianrpetrin commented Oct 1, 2018

Given the number and variety of queue implementations that already exist, and their variation on properties such as push-front support and concurrency-safety, I would prefer that we not add a queue to the standard library — especially not before we figure out what's going on with generics (#15292).

@bcmills please consider below.

  1. The queue proposed here is dramatically faster than every other queue tested. If you known of another queue implementation you feel like might be a better design/implementation than the proposed here, please let me know and I'm extremely happy to validate it. If we find a faster, balanced queue implementation, I'm more than happy to root for that implementation to make into the stdlib. As a matter of fact, I'd definitely use it in all my projects
  2. @networkimprov mentioned generics are a long way off. Should we wait and hold off improvements until we finally decide what we want to do with generics? How long have we been talking about generics? How much longer will it take? IMO we should keep on with an agile/SCRUM like approach to Go development where small, incremental improvements are constantly built and released. That's one of the reasons why Go is beating every other mainstream languages out there. We should allow and foster organic growth and constant small improvements to the language. High level discussions do have their place and should exist, but they can take place unilaterally and should not hold off development for very long.
  3. I'm very happy to implement changes the proposed the queue to make it better. All suggestions and criticism are most welcome. Especially the people who is thumbing down my proposal, I'm not seeing many comments why the proposal is not good. Please let us know why you disagree with the proposal and reply back to my comments if I do so, so we can have a constructive conversation.

@cznic
Copy link
Contributor

cznic commented Oct 1, 2018

Slices can be used as queues, but often they're not. Sometimes a program collects items and sort them, for example. A bounded slice is not helpful in this case. If the sorting requires more memory than available, there's no other option, in the first approximation, than to crash.

Maps serve a different purpose, but the argument about bound/unbound is the same.

There are many other general data structures with different big-O characteristics of their particular operations. One usually carefully selects the proper one and the argument about bound/unbound is still the same.

Channels typically serve as a queue between concurrently executing producers and consumers. They have a limit that blocks producers when reached and provides useful back pressure. Making a channel unlimited loses that property.

A non-channel queue (FIFO) offers a rather limited set of operations: Put and Get (or whatever the names were chosen). If it is to be used in the scenario where a slice, map or other generic data structure can be used, then there's no real need to have or use it.

The remaining scenarios for using a FIFO are IMO mostly - if not only - where the producers and consumers execute concurrently. If a FIFO is used effectively as a channel there are two important questions to answer:

  • Why not use a channel in the first place?
  • Why the back pressure property is useful for channels but nut useful for a FIFO used as a channel?

Also note that slices and maps that exist today in the language are type generic. The proposed unbound queue seems to use interface{} typed items. That adds memory overhead and hurts cache locality. I haven't really studied the code, but in a proper benchmark, []interface{} cannot beat []int in any of those two aspects. Similarly for []interface{} vs a buffered chan int.

@bcmills
Copy link
Contributor

bcmills commented Oct 1, 2018

If you known of another queue implementation you feel like might be a better design/implementation than the proposed here, please let me know and I'm extremely happy to validate it.

See https://golang.org/cl/94137 (which I ought to dust off and submit one of these days).

@christianrpetrin
Copy link
Author

christianrpetrin commented Oct 1, 2018

Also to help validate my point of the whole bunch of problematic Go queues out there, I was able to validate the cookiejar queue actually has a memory leak. Just ran some bench tests locally with large items added.

go test -benchmem -count 1 -benchtime 60s -timeout 60m -bench=".*Cookiejar" -run=^$
goos: darwin
goarch: amd64
pkg: github.com/christianrpetrin/queue-tests
BenchmarkCookiejar/1000000-4         	    2000	  44099280 ns/op	24830036 B/op	 1000489 allocs/op
BenchmarkCookiejar/10000000-4        	     200	 464993059 ns/op	317408985 B/op	10004883 allocs/op
BenchmarkCookiejar/100000000-4       	      10	7453667009 ns/op	9649322720 B/op	100048829 allocs/op
^Csignal: interrupt
FAIL	github.com/christianrpetrin/queue-tests	1510.242s

The memory growth is not linear to the growth of work. Also the last 1bi test took so long (5+ minutes), I had to abort the test. Impl7 doesn't suffer from these problems. I would most certainly not recommend cookiejar queue implementation until these issues are fixed.

@ianlancetaylor
Copy link
Member

This is a worthy proposal, but, since generics are an active topic of discussion (https://go.googlesource.com/proposal/+/master/design/go2draft.md), I agree with @bcmills that we should not pursue this further until that discussion has settled. It will inevitably affect the API, and possibly the implementation, of any new container types in the standard library.

@ianlancetaylor
Copy link
Member

Let me add that a queue implementation is just as good outside the standard library as in. There is no rush to adding queues to the standard library; people can use them effectively as an external package.

@networkimprov
Copy link

Specially the people who is thumbing down my proposal, I'm not seeing many comments why the proposal is not good.

Go philosophy takes a "when in doubt, leave it out" stance, so some ppl thumb-down stuff they hardly/never use. For example, notice the thumbs on this proposal to drop complex number support: #19921. From what I've read, the maintainers don't make decisions based on thumb-votes.

@christianrpetrin
Copy link
Author

christianrpetrin commented Oct 1, 2018

@ianlancetaylor agree the generics discussion should move forward. My only concern is how much longer will it take until a final decision is made and deployed? There's a clear need of a general purpose queue in the Go stdlib. Otherwise we wouldn't see so many open source queue implementations out there. Can we get some sort of timeline on when the decision on generics will be made?

Also agree a queue could be deployed to any open source repo, but please read the background piece of my proposal to help you understand why I think there's a serious need for a high perf, issue free queue in the Go stdlib. To highlight the point:

Using external packages has to be done with great care. Packages on the stdlib, however, are supposed to be really well tested and performatic, as they are validated by the whole Go community. A decision to use a stdlib package is an easy one, but to use an open source, not well disseminated implementation is much more complicated decision to make. The other big problem is how does people find out what are the best, safe to use queue implementations? As pointed by @egonelbre, this one looks like a really interesting queue, but it has hidden memory leaks. Most Go engineers doesn't have the time to validate and properly probe these queues. That's the biggest value we can provide to the whole community: a safe, performant and very easy queue to be used.

@networkimprov
Copy link

@ianlancetaylor how about landing a queue in golang.org/x in the interim?

@ianlancetaylor
Copy link
Member

There is no specific timeline for generics support.

The argument about what should be in the standard library is a large one that should not take place on this issue. I'll just mention the FAQ: https://golang.org/doc/faq#x_in_std .

@Merovius
Copy link
Contributor

Merovius commented Oct 9, 2018

For a more real world like scenario

I disagree with the characterization of interface{} as "more real world like". I don't think there is anything more realistic about pointer-shaped values over non-pointer-shaped ones - on the contrary, actually. I can't think of a good reason why things in a queue should be pointers. ISTM they will, in general, represent plain data.

much more flexible

The point is specifically (and it would be helpful if you'd acknowledge that instead of dismissing it as an "edge case") that this flexibility incurs a cost, in the form of more allocations and indirections. The generic queue implementation is slower. And FTR, I predict that generics won't change that either (though we'll have to see that once they are implemented).

@rsc
Copy link
Contributor

rsc commented Oct 10, 2018

At the least, this proposal should be marked as "on hold for generics". But really I think this should be done outside the standard library and prove its necessity first. We have not been creating new packages from scratch in the standard library for quite some time.

@rsc rsc changed the title Proposal: Built in support for high performance unbounded queue proposal: add container/queue Oct 10, 2018
@christianrpetrin
Copy link
Author

@rsc I agree we should wait for the generics discussion to take place. In the meantime, I'll try to validate the proposed solution as much as possible by pinging golang-nuts, making a proper external package, etc. The biggest problem about exposing the queue as an external package only is how to bring awareness about it. Marketing is a gigantic problem that I can't solve by myself. It will probably take years before the package really takes off and people start using it and report their experiences.

In any case, I'll keep an eye and try to join the generics discussion as well.

Thanks for taking a look at the proposal!

@christianrpetrin
Copy link
Author

christianrpetrin commented Nov 26, 2018

Given the many suggestions, I have built a new queue and deployed it as a proper package. The new queue is based on impl7 and is actually a deque now. Appreciate the suggestions (@phf, @bcmills) as a deque is indeed a much more flexible data structure that can be used not only as a FIFO queue, but also as LIFO stack and as a deque with mixed push/pop.

The new deque is a considerably more advanced data structure than impl7 due to the fact is now a deque and also because it has a new design.

The new design leverages the linked slices design by making it an auto shrinking ring. The new design improved impl7 performance and efficiency quite dramatically in most tests, although it is now a bit slower on the fill tests. Refer here for the design details.

ns/op

I have also implemented many new new unit, integration, API and especially, benchmark tests. @egonelbre, much appreciated for the test suggestions and to benchmark cookiejar's queue.

Given the need for real world usage data, I need to somehow get people to start using it. I'll do my best to raise awareness about the deque, but this is an area I really need help. If you are using a queue/stack/deque, please consider switching to the new deque. Also if you know someone who is using a queue/stack/deque, please let them know about the deque.

Below is a few select tests of deque vs impl7.

deque vs impl7 - FIFO queue - fill tests

benchstat testdata/BenchmarkFillDequeQueue.txt testdata/BenchmarkFillImpl7Queue.txt
name        old time/op    new time/op    delta
/0-4          39.9ns ± 9%    35.9ns ± 3%   -9.88%  (p=0.000 n=10+9)
/1-4           142ns ± 1%     133ns ± 1%   -6.61%  (p=0.000 n=10+10)
/10-4          636ns ± 1%     764ns ± 7%  +20.26%  (p=0.000 n=9+9)
/100-4        4.74µs ± 3%    4.28µs ± 3%   -9.67%  (p=0.000 n=9+9)
/1000-4       43.0µs ±23%    38.8µs ± 7%   -9.84%  (p=0.004 n=10+10)
/10000-4       450µs ±19%     388µs ± 5%  -13.70%  (p=0.001 n=10+10)
/100000-4     4.24ms ± 4%    3.95ms ± 2%   -6.84%  (p=0.000 n=10+8)
/1000000-4    46.8ms ± 1%    45.9ms ± 4%   -1.87%  (p=0.008 n=8+9)

name        old alloc/op   new alloc/op   delta
/0-4           64.0B ± 0%     48.0B ± 0%  -25.00%  (p=0.000 n=10+10)
/1-4            144B ± 0%      112B ± 0%  -22.22%  (p=0.000 n=10+10)
/10-4           608B ± 0%      736B ± 0%  +21.05%  (p=0.000 n=10+10)
/100-4        6.19kB ± 0%    4.26kB ± 0%  -31.27%  (p=0.000 n=10+10)
/1000-4       33.0kB ± 0%    33.2kB ± 0%   +0.58%  (p=0.000 n=10+10)
/10000-4       322kB ± 0%     323kB ± 0%   +0.23%  (p=0.000 n=10+10)
/100000-4     3.22MB ± 0%    3.23MB ± 0%   +0.20%  (p=0.000 n=10+10)
/1000000-4    32.2MB ± 0%    32.3MB ± 0%   +0.19%  (p=0.000 n=10+9)

name        old allocs/op  new allocs/op  delta
/0-4            1.00 ± 0%      1.00 ± 0%     ~     (all equal)
/1-4            4.00 ± 0%      4.00 ± 0%     ~     (all equal)
/10-4           15.0 ± 0%      17.0 ± 0%  +13.33%  (p=0.000 n=10+10)
/100-4           107 ± 0%       109 ± 0%   +1.87%  (p=0.000 n=10+10)
/1000-4        1.01k ± 0%     1.02k ± 0%   +0.99%  (p=0.000 n=10+10)
/10000-4       10.1k ± 0%     10.2k ± 0%   +0.79%  (p=0.000 n=10+10)
/100000-4       101k ± 0%      102k ± 0%   +0.78%  (p=0.000 n=10+10)
/1000000-4     1.01M ± 0%     1.02M ± 0%   +0.78%  (p=0.000 n=10+10)

deque vs impl7 - FIFO queue - refill tests

benchstat testdata/BenchmarkRefillDequeQueue.txt testdata/BenchmarkRefillImpl7Queue.txt
name       old time/op    new time/op    delta
/1-4         3.79µs ± 1%   10.03µs ± 6%  +165.06%  (p=0.000 n=9+10)
/10-4        37.8µs ± 7%    74.7µs ± 5%   +97.58%  (p=0.000 n=10+10)
/100-4        361µs ± 7%     442µs ± 4%   +22.25%  (p=0.000 n=9+10)
/1000-4      3.75ms ± 4%    3.97ms ± 3%    +6.10%  (p=0.000 n=10+9)
/10000-4     36.5ms ± 3%    39.1ms ± 6%    +7.17%  (p=0.000 n=10+10)
/100000-4     380ms ± 1%     421ms ±24%   +10.67%  (p=0.000 n=10+9)

name       old alloc/op   new alloc/op   delta
/1-4         1.60kB ± 0%    6.40kB ± 0%  +300.00%  (p=0.000 n=10+10)
/10-4        16.0kB ± 0%    68.8kB ± 0%  +330.00%  (p=0.000 n=10+10)
/100-4        160kB ± 0%     421kB ± 0%  +163.00%  (p=0.000 n=8+10)
/1000-4      2.42MB ± 0%    3.32MB ± 0%   +37.30%  (p=0.000 n=10+10)
/10000-4     17.0MB ± 0%    32.3MB ± 0%   +89.36%  (p=0.000 n=8+10)
/100000-4     162MB ± 0%     323MB ± 0%   +99.52%  (p=0.000 n=8+8)

name       old allocs/op  new allocs/op  delta
/1-4            100 ± 0%       300 ± 0%  +200.00%  (p=0.000 n=10+10)
/10-4         1.00k ± 0%     1.60k ± 0%   +60.00%  (p=0.000 n=10+10)
/100-4        10.0k ± 0%     10.8k ± 0%    +8.00%  (p=0.000 n=10+10)
/1000-4        100k ± 0%      102k ± 0%    +1.80%  (p=0.000 n=10+10)
/10000-4      1.00M ± 0%     1.02M ± 0%    +1.57%  (p=0.000 n=8+10)
/100000-4     10.0M ± 0%     10.2M ± 0%    +1.56%  (p=0.000 n=10+10)

Although impl7 marginally outperforms deque on the fill test, deque outperforms impl7
on all other tests by large margins in most cases, especially when it comes to memory usage.
The refill and (below) stable are just two examples.

deque vs impl7 - FIFO queue - stable tests

benchstat testdata/BenchmarkStableDequeQueue.txt testdata/BenchmarkStableImpl7Queue.txt
name        old time/op    new time/op    delta
/1-4          34.8ns ± 1%    38.1ns ± 5%    +9.48%  (p=0.000 n=10+9)
/10-4          353ns ± 2%     391ns ± 7%   +10.92%  (p=0.000 n=10+10)
/100-4        3.45µs ± 1%    4.13µs ±19%   +19.82%  (p=0.000 n=10+10)
/1000-4       34.5µs ± 1%    39.6µs ±12%   +14.95%  (p=0.000 n=10+10)
/10000-4       346µs ± 2%     385µs ± 4%   +11.40%  (p=0.000 n=10+10)
/100000-4     3.43ms ± 1%    3.85ms ±11%   +12.11%  (p=0.000 n=10+10)
/1000000-4    34.4ms ± 1%    38.3ms ± 6%   +11.13%  (p=0.000 n=10+10)

name        old alloc/op   new alloc/op   delta
/1-4           16.0B ± 0%     32.0B ± 0%  +100.00%  (p=0.000 n=10+10)
/10-4           160B ± 0%      322B ± 0%  +101.25%  (p=0.000 n=10+10)
/100-4        1.60kB ± 0%    3.23kB ± 0%  +101.56%  (p=0.000 n=10+10)
/1000-4       16.0kB ± 0%    32.2kB ± 0%  +101.56%  (p=0.000 n=10+10)
/10000-4       160kB ± 0%     322kB ± 0%  +101.56%  (p=0.000 n=10+10)
/100000-4     1.60MB ± 0%    3.23MB ± 0%  +101.56%  (p=0.000 n=10+10)
/1000000-4    16.0MB ± 0%    32.3MB ± 0%  +101.56%  (p=0.000 n=10+10)

name        old allocs/op  new allocs/op  delta
/1-4            1.00 ± 0%      1.00 ± 0%      ~     (all equal)
/10-4           10.0 ± 0%      10.0 ± 0%      ~     (all equal)
/100-4           100 ± 0%       101 ± 0%    +1.00%  (p=0.000 n=10+10)
/1000-4        1.00k ± 0%     1.01k ± 0%    +1.50%  (p=0.000 n=10+10)
/10000-4       10.0k ± 0%     10.2k ± 0%    +1.56%  (p=0.000 n=10+10)
/100000-4       100k ± 0%      102k ± 0%    +1.56%  (p=0.000 n=10+10)
/1000000-4     1.00M ± 0%     1.02M ± 0%    +1.56%  (p=0.000 n=10+10)

Refer here for all benchmark tests.

The new package is production ready and I have released its first version. It can be found here.

Regarding the suggestions for more tests (here, here) and to gather real world data, @bcmills, @egonelbre, does the new benchmark tests help address your concerns?

Regarding the questions whether a FIFO queue is actually useful (here,here), @egonelbre, @rsc, does a deque help address your concerns? Please consider a deque not only can be used as a FIFO queue, but also as a LIFO stack as well as with mixed push/pop.

Regarding the concerns whether Go needs a specialized queue or just using a simple slice as a queue is good enough (here), @Merovius, does the new queue performance vs slice helps address your concerns? In the wake of all the new benchmark tests and the much improved deque design, do you still feel like a deque naive implementation that uses a slice such as CustomSliceQueue is the best option?

@tv42
Copy link

tv42 commented Jun 9, 2019

Marketing is a gigantic problem that I can't solve by myself.

I'm an outsider to this conversation, but please take this point: the Go stdlib (or golang.org/x) should not be a marketing mechanism for your code; especially not for its clever implementation. That line of thinking can only decrease the stdlib's technical quality.

@gopherbot gopherbot removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Aug 16, 2019
@christianrpetrin
Copy link
Author

Marketing is a gigantic problem that I can't solve by myself.

I'm an outsider to this conversation, but please take this point: the Go stdlib (or golang.org/x) should not be a marketing mechanism for your code; especially not for its clever implementation. That line of thinking can only decrease the stdlib's technical quality.

@tv42 , thanks for joining the discussion!

I realized (and mentioned) the marketing problem only after it was proposed to build an external queue package and get real, production systems to use it to validate its performance. My point was I need help to raise awareness about the Deque package, so people would consider using it. It was never my intention in the first place to even publish an external package. The goal was to propose a very efficient queue implementation to be added to the standard library. For that, there should be no need to market the dequeue in any shape or form other than propose it here.

@phemmer
Copy link

phemmer commented Apr 18, 2020

Wanted to throw in a use case here as I've been searching for a queue-like solution and none of the existing implementations I can find address it (which has me suspicious I'm thinking about the problem the wrong way, or focused on the wrong solution, but ignoring that for now).
The difficulty I'm encountering is the need for a queue where I can peek at the entire queue.

Might make sense in an example: Lets say the app is sending messages to a remote system asynchronously. So it sends a number of messages individually (batching handled transparently), and gets an acknowledgement some time in the future of some, or all, of the messages. The app needs to handle things like losing its connection, so it has to hang on to the messages until it receives that acknowledgement. If it does lose the connection, once reestablished, it has to resend them.

At first this seems like a pretty straightforward unbounded circular queue. Something that is usually small, but can grow when needed, and can handle frequent push/pop without reallocation. However that difficulty I mentioned is when the connection is reestablished. The app needs to resend the messages, so it has to iterate the queue. But it can't remove messages from the queue until they've been acknowledged. All the queue implementations I've seen (including the one proposed) only allow you to peek at the first item. Meaning that we'd have to send one message and wait for that ack before before proceeding to the next. This can be extremely slow, and without batching, it may never catch back up to real-time.

@Merovius
Copy link
Contributor

@phemmer I think you could also remove an element and add it again (to the other end). But also, wouldn't something like var unackedMsgs map[msgID]message make far more sense? After all, that allows you to drop the requirement that messages are ack'ed in the same order they are sent. Lastly, if you actually want an ordered queue, then why not use a slice? It allows you to iterate over it and it can grow dynamically.

@phemmer
Copy link

phemmer commented Apr 18, 2020

Removing and adding to the end destroys ordering. The connection could die half way through the re-sending, and the message source is still adding its own messages to the queue as well.
Messages are always acked by the remote end in the order that they were sent. It'll never ack message 10 without message message 9 having been acked.

I've considered using a map like you suggest, but it would mean that when re-sending, I'd have to grab the keys and sort them before re-sending. And as the message source is still adding messages, this may have to be done several times, skipping the ones that are now in flight. Where as with a queue I could just iterate until I reach the end. But it's not a terrible idea.

As for why not using a slice, because of the high amount of reallocation, or manual work to prevent re-allocation. As messages are added, and shifted off the front, the slice would either have to be re-allocated, or the contents moved over.

But if we don't think the ability to peek at the whole queue is something the proposed solution should address, then this can be dropped. Don't want to hijack this issue for solving my problem. Just wanted to throw the use case out for consideration.

@sfllaw
Copy link
Contributor

sfllaw commented Oct 14, 2021

@christianrpetrin If we are going to add container/deque, then it should probably support generics from the very beginning. Perhaps we should also wait for more clarity on how #48287 will affect API design?

P.S. I think your work is great and I’m not asking you to rewrite it. I’m only pointing out that it would be a real shame to implement a pre-generics API for a stdlib container.

@alphadose
Copy link

my implementation https://github.com/alphadose/ZenQ

also this -> #52652

@egonelbre
Copy link
Contributor

@alphadose this is about implementing an unbounded queue.

@changkun
Copy link
Member

changkun commented May 3, 2022

Generics are out. Should we remove the "Hold"?

This issue discusses too many implementation details. If I understand the proposal correctly, with the current generics design, we are likely to propose a set of APIs as follows?

package queue

// Queue implements a FIFO queue.
// The zero value of Queue is an empty queue ready to use.
type Queue[T any] struct

func (q *Queue[T]) Push(v T)
func (q *Queue[T]) Pop() T
func (q *Queue[T]) Len() int
func (q *Queue[T]) Front() T
func (q *Queue[T]) Tail() T
func (q *Queue[T]) Clear()

@alphadose
Copy link

alphadose commented May 3, 2022

+1 to the above
And also we can create a golang STL much like C++ STL

PS:- it would be even better if we can make the STL thread-safe by default which is not the case in C++

@ianlancetaylor
Copy link
Member

Per #27935 (comment) this should be done outside the standard library first. Has that happened?

@changkun
Copy link
Member

changkun commented May 3, 2022

It should have happened since day 1 because the proposal also included an implementation that lives here: https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/queueimpl7.go, although it is not a generic version.

@alphadose
Copy link

A good candidate https://github.com/liyue201/gostl
covers most DS/algos but not thread safe and doesn't use generics
With a bit of modification, we can have a thread safe generic golang STL

@lovromazgon
Copy link

I added a generic implementation of Queueimpl7 (fork), here is a PR containing the changes I did, the benchmark results suggest that the generic version is even faster.

Benchmark results
go test -bench=. benchmark_test.go
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkList/0-16          	36096993	        33.10 ns/op
BenchmarkList/1-16          	15740581	        72.43 ns/op
BenchmarkList/10-16         	 2861898	       425.1 ns/op
BenchmarkList/100-16        	  263336	      4220 ns/op
BenchmarkList/1000-16       	   25462	     47118 ns/op
BenchmarkList/10000-16      	    2110	    534454 ns/op
BenchmarkList/100000-16     	     127	   9137029 ns/op
BenchmarkChannel/0-16       	35017080	        31.85 ns/op
BenchmarkChannel/1-16       	15119679	        76.04 ns/op
BenchmarkChannel/10-16      	 2512936	       470.3 ns/op
BenchmarkChannel/100-16     	  249133	      4416 ns/op
BenchmarkChannel/1000-16    	   23700	     49719 ns/op
BenchmarkChannel/10000-16   	    2091	    527633 ns/op
BenchmarkChannel/100000-16  	     219	   5352969 ns/op
BenchmarkGammazero/0-16     	1000000000	         1.158 ns/op
BenchmarkGammazero/1-16     	23397309	        44.38 ns/op
BenchmarkGammazero/10-16    	 9304092	       127.2 ns/op
BenchmarkGammazero/100-16   	  668485	      1562 ns/op
BenchmarkGammazero/1000-16  	   57878	     20585 ns/op
BenchmarkGammazero/10000-16 	    4836	    221882 ns/op
BenchmarkGammazero/100000-16         	     477	   2449230 ns/op
BenchmarkPhf/0-16                    	49410674	        23.23 ns/op
BenchmarkPhf/1-16                    	35873148	        31.50 ns/op
BenchmarkPhf/10-16                   	 2566104	       458.0 ns/op
BenchmarkPhf/100-16                  	  405482	      2672 ns/op
BenchmarkPhf/1000-16                 	   45190	     26028 ns/op
BenchmarkPhf/10000-16                	    4074	    266890 ns/op
BenchmarkPhf/100000-16               	     237	   5075018 ns/op
BenchmarkJuju/0-16                   	 3639322	       324.1 ns/op
BenchmarkJuju/1-16                   	 3553687	       337.7 ns/op
BenchmarkJuju/10-16                  	 2811319	       427.6 ns/op
BenchmarkJuju/100-16                 	  624660	      1837 ns/op
BenchmarkJuju/1000-16                	   56377	     21430 ns/op
BenchmarkJuju/10000-16               	    4329	    241211 ns/op
BenchmarkJuju/100000-16              	     369	   3212147 ns/op
BenchmarkImpl1/0-16                  	286230117	         4.186 ns/op
BenchmarkImpl1/1-16                  	28610796	        36.27 ns/op
BenchmarkImpl1/10-16                 	 3347899	       362.3 ns/op
BenchmarkImpl1/100-16                	  680595	      1635 ns/op
BenchmarkImpl1/1000-16               	   52677	     21847 ns/op
BenchmarkImpl1/10000-16              	    4353	    246534 ns/op
BenchmarkImpl1/100000-16             	     136	   8901747 ns/op
BenchmarkImpl2/0-16                  	278119263	         4.377 ns/op
BenchmarkImpl2/1-16                  	28694433	        36.99 ns/op
BenchmarkImpl2/10-16                 	 3432369	       355.2 ns/op
BenchmarkImpl2/100-16                	  650058	      1645 ns/op
BenchmarkImpl2/1000-16               	   54530	     21765 ns/op
BenchmarkImpl2/10000-16              	    3944	    286462 ns/op
BenchmarkImpl2/100000-16             	     136	   8631269 ns/op
BenchmarkImpl3/0-16                  	 2795673	       422.7 ns/op
BenchmarkImpl3/1-16                  	 2726344	       431.1 ns/op
BenchmarkImpl3/10-16                 	 2419592	       488.0 ns/op
BenchmarkImpl3/100-16                	  999382	      1041 ns/op
BenchmarkImpl3/1000-16               	   70945	     16683 ns/op
BenchmarkImpl3/10000-16              	    6306	    192223 ns/op
BenchmarkImpl3/100000-16             	     532	   2215646 ns/op
BenchmarkImpl4/0-16                  	 3473112	       342.2 ns/op
BenchmarkImpl4/1-16                  	 3378958	       353.7 ns/op
BenchmarkImpl4/10-16                 	 2882436	       416.4 ns/op
BenchmarkImpl4/100-16                	 1219239	       980.0 ns/op
BenchmarkImpl4/1000-16               	   74776	     16138 ns/op
BenchmarkImpl4/10000-16              	    5860	    193688 ns/op
BenchmarkImpl4/100000-16             	     452	   2592877 ns/op
BenchmarkImpl5/0-16                  	 3095726	       383.5 ns/op
BenchmarkImpl5/1-16                  	 2985129	       405.7 ns/op
BenchmarkImpl5/10-16                 	 2607858	       460.0 ns/op
BenchmarkImpl5/100-16                	  999908	      1015 ns/op
BenchmarkImpl5/1000-16               	   69468	     16653 ns/op
BenchmarkImpl5/10000-16              	    5889	    197066 ns/op
BenchmarkImpl5/100000-16             	     415	   2834379 ns/op
BenchmarkImpl6/0-16                  	717184951	         1.692 ns/op
BenchmarkImpl6/1-16                  	17817142	        63.48 ns/op
BenchmarkImpl6/10-16                 	 3145177	       386.9 ns/op
BenchmarkImpl6/100-16                	  650286	      1635 ns/op
BenchmarkImpl6/1000-16               	   62350	     19002 ns/op
BenchmarkImpl6/10000-16              	    5012	    213779 ns/op
BenchmarkImpl6/100000-16             	     499	   2388561 ns/op
BenchmarkImpl7/0-16                  	1000000000	         1.005 ns/op
BenchmarkImpl7/1-16                  	19193989	        59.29 ns/op
BenchmarkImpl7/10-16                 	 3058149	       400.5 ns/op
BenchmarkImpl7/100-16                	  724555	      1444 ns/op
BenchmarkImpl7/1000-16               	   68046	     17667 ns/op
BenchmarkImpl7/10000-16              	    5490	    199761 ns/op
BenchmarkImpl7/100000-16             	     526	   2275840 ns/op
BenchmarkImpl7Generic/0-16           	1000000000	         1.015 ns/op
BenchmarkImpl7Generic/1-16           	21470794	        49.56 ns/op
BenchmarkImpl7Generic/10-16          	 4868738	       242.6 ns/op
BenchmarkImpl7Generic/100-16         	  929288	      1138 ns/op
BenchmarkImpl7Generic/1000-16        	   72534	     16419 ns/op
BenchmarkImpl7Generic/10000-16       	    6132	    188530 ns/op
BenchmarkImpl7Generic/100000-16      	     644	   1865776 ns/op
BenchmarkCookiejar/0-16              	  135667	      8730 ns/op
BenchmarkCookiejar/1-16              	  136633	      8764 ns/op
BenchmarkCookiejar/10-16             	  135256	      8783 ns/op
BenchmarkCookiejar/100-16            	  125581	      9418 ns/op
BenchmarkCookiejar/1000-16           	   49455	     24119 ns/op
BenchmarkCookiejar/10000-16          	    5970	    193347 ns/op
BenchmarkCookiejar/100000-16         	     519	   2263696 ns/op
BenchmarkBcmills/0-16                	527711010	         2.181 ns/op
BenchmarkBcmills/1-16                	30507412	        37.13 ns/op
BenchmarkBcmills/10-16               	 1707492	       706.4 ns/op
BenchmarkBcmills/100-16              	  475962	      2306 ns/op
BenchmarkBcmills/1000-16             	   33268	     35265 ns/op
BenchmarkBcmills/10000-16            	    3345	    331535 ns/op
BenchmarkBcmills/100000-16           	      84	  11991809 ns/op
PASS
ok  	command-line-arguments	146.154s

@christianrpetrin
Copy link
Author

Completely agree, @sfllaw! Any first Deque version that makes into the STL should support generics, so I added support for generics in Deque v2.0.0. Check it out! https://github.com/ef-ds/deque

I appreciate the kind words!

@christianrpetrin
Copy link
Author

christianrpetrin commented Mar 4, 2023

@changkun , @ianlancetaylor , regarding your comments here and here.

@changkun, yes, completely agree we should remove the hold and move forward with the proposal.

@ianlancetaylor, yes, a double ended queue, that is a major improvement over the initially porposed FIFO queue, was implemented and released here in 2018. Currently, the released Deque package has over 200 open source repositories importing it in over 400 packages, including some large open source projects such as go-micro, ziti, flow, etc! Unfortunately, GitHub doesn't give me info about private repos importing it, but considering the number of git clones, which sits at almost 150 clones every single day, suggests Deque is likely being used by potentially thousands of private projects as well. Check out below some of the Deque's usage metrics.

image
image

I believe the proposal should be approved due to below main points:

  • The many great suggestions the OSS community posted in this proposal (as well as via email exchanges in golang-nuts) has been addressed. A few of the great addressed suggestions are:
    • Added additional tests: here, here (thanks @bcmills, @egonelbre)
    • Make it a double ended queue instead of simple FIFO queue: here (thanks @phf, others)
    • Deque has been released outside of the std library (request) and I believe, given its current usage, it has proven to be useful (thanks @rsc)
  • Deque implementation has been validated and I believe the current code has no bugs/issues as no new issues has been filed since 27/04/2020 (almost 3 years now). Last one was this one
  • Deque's performance is unrivaled, even when tested against the best designed and implemented available open source queues. Not only it performs incredibly well in all test cases, but still manages to do so with an unmatched low memory footprint. Refer here for details
  • Generics is out! Deque fully supports generics now and it performs even better than before. Refer here for details (btw, great job implementing generics, Go team, especially @ianlancetaylor !)

@christianrpetrin
Copy link
Author

christianrpetrin commented Mar 4, 2023

Generics are out. Should we remove the "Hold"?

This issue discusses too many implementation details. If I understand the proposal correctly, with the current generics design, we are likely to propose a set of APIs as follows?

package queue

// Queue implements a FIFO queue.
// The zero value of Queue is an empty queue ready to use.
type Queue[T any] struct

func (q *Queue[T]) Push(v T)
func (q *Queue[T]) Pop() T
func (q *Queue[T]) Len() int
func (q *Queue[T]) Front() T
func (q *Queue[T]) Tail() T
func (q *Queue[T]) Clear()

@changkun, regarding the API, Deque's current API is as follow.

Should the proposal be approved, I propose we go with this design. However, I'm completely open to review and make changes to either the API design and pretty much anywhere in the source code and documentation should the community agree with the changes.

type Deque[T any] struct

func New[T any]() *Deque[T]
func (d *Deque[T]) Init() *Deque[T] // Init initializes or clears deque d.
func (d *Deque[T]) Len() int
func (d *Deque[T]) Front() (T, bool)
func (d *Deque[T]) Back() (T, bool)
func (d *Deque[T]) PushFront(v T)
func (d *Deque[T]) PushBack(v T)
func (d *Deque[T]) PopFront() (T, bool)
func (d *Deque[T]) PopBack() (T, bool)

Note: The second, bool result in Front, Back, PopFront and PopBack indicates whether a valid value was returned; if the deque is empty, false will be returned. This follows the same idea used by map (i.e. "value, ok := myMap["key"]")

@christianrpetrin
Copy link
Author

christianrpetrin commented Mar 4, 2023

@alphadose, thanks for bringing light to the GoSTL's Deque. I agree GoSTL is a good deque. I personally really like its segmented design, however, I added GoSTL's deque to the Deque's benchmark tests and it doesn't perform very well against Deque. As an example, check out below the results of the Microservice test.

benchstat testdata/BenchmarkMicroserviceDequeQueue.txt testdata/BenchmarkMicroserviceGostlQueue.txt
name        old time/op    new time/op    delta
/0-4          45.9ns ± 0%    87.9ns ± 2%   +91.81%  (p=0.000 n=8+9)
/1-4           431ns ± 1%    1243ns ± 1%  +188.38%  (p=0.000 n=9+9)
/10-4         2.70µs ± 1%    7.55µs ± 1%  +179.42%  (p=0.000 n=9+10)
/100-4        23.8µs ± 2%    68.3µs ± 1%  +187.27%  (p=0.000 n=10+10)
/1000-4        235µs ± 4%     685µs ± 1%  +191.26%  (p=0.000 n=10+9)
/10000-4      2.29ms ± 3%    6.86ms ± 2%  +199.21%  (p=0.000 n=10+9)
/100000-4     26.0ms ± 1%    71.3ms ± 1%  +174.80%  (p=0.000 n=10+10)
/1000000-4     275ms ± 2%     724ms ± 2%  +163.62%  (p=0.000 n=10+10)

name        old alloc/op   new alloc/op   delta
/0-4           64.0B ± 0%     88.0B ± 0%   +37.50%  (p=0.000 n=10+10)
/1-4            384B ± 0%     1352B ± 0%  +252.08%  (p=0.000 n=10+10)
/10-4         1.90kB ± 0%    2.60kB ± 0%   +36.55%  (p=0.000 n=10+10)
/100-4        16.2kB ± 0%    14.8kB ± 0%    -8.26%  (p=0.000 n=10+10)
/1000-4        123kB ± 0%     156kB ± 0%   +26.23%  (p=0.000 n=10+10)
/10000-4      1.28MB ± 0%    1.54MB ± 0%   +20.67%  (p=0.000 n=10+10)
/100000-4     12.8MB ± 0%    15.4MB ± 0%   +19.80%  (p=0.000 n=10+10)
/1000000-4     128MB ± 0%     154MB ± 0%   +19.71%  (p=0.000 n=9+10)

name        old allocs/op  new allocs/op  delta
/0-4            1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=10+10)
/1-4            11.0 ± 0%      18.0 ± 0%   +63.64%  (p=0.000 n=10+10)
/10-4           75.0 ± 0%     101.0 ± 0%   +34.67%  (p=0.000 n=10+10)
/100-4           709 ± 0%       911 ± 0%   +28.49%  (p=0.000 n=10+10)
/1000-4        7.01k ± 0%     9.07k ± 0%   +29.32%  (p=0.000 n=10+10)
/10000-4       70.2k ± 0%     90.4k ± 0%   +28.86%  (p=0.000 n=10+10)
/100000-4       702k ± 0%      903k ± 0%   +28.75%  (p=0.000 n=10+10)
/1000000-4     7.02M ± 0%     9.03M ± 0%   +28.73%  (p=0.000 n=10+10)

benchstat testdata/BenchmarkMicroserviceDequeStack.txt testdata/BenchmarkMicroserviceGostlStack.txt
name        old time/op    new time/op    delta
/0-4          45.9ns ± 1%    87.1ns ± 1%   +89.85%  (p=0.000 n=9+8)
/1-4           360ns ± 1%    1307ns ± 3%  +263.31%  (p=0.000 n=9+9)
/10-4         2.53µs ± 0%    8.37µs ± 1%  +231.01%  (p=0.000 n=8+10)
/100-4        23.2µs ± 2%    76.4µs ± 0%  +228.90%  (p=0.000 n=10+8)
/1000-4        229µs ± 1%     770µs ± 0%  +236.41%  (p=0.000 n=9+10)
/10000-4      2.38ms ± 1%    7.72ms ± 1%  +224.42%  (p=0.000 n=9+10)
/100000-4     26.2ms ± 4%    80.2ms ± 1%  +206.81%  (p=0.000 n=10+10)
/1000000-4     274ms ± 2%     809ms ± 1%  +195.78%  (p=0.000 n=10+10)

name        old alloc/op   new alloc/op   delta
/0-4           64.0B ± 0%     88.0B ± 0%   +37.50%  (p=0.000 n=10+10)
/1-4            256B ± 0%     1352B ± 0%  +428.12%  (p=0.000 n=10+10)
/10-4         1.39kB ± 0%    2.60kB ± 0%   +86.78%  (p=0.000 n=10+10)
/100-4        14.1kB ± 0%    14.8kB ± 0%    +5.40%  (p=0.000 n=10+10)
/1000-4        121kB ± 0%     157kB ± 0%   +29.37%  (p=0.000 n=10+10)
/10000-4      1.27MB ± 0%    1.54MB ± 0%   +20.99%  (p=0.000 n=10+10)
/100000-4     12.8MB ± 0%    15.4MB ± 0%   +19.81%  (p=0.000 n=10+10)
/1000000-4     128MB ± 0%     154MB ± 0%   +19.71%  (p=0.000 n=10+9)

name        old allocs/op  new allocs/op  delta
/0-4            1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=10+10)
/1-4            10.0 ± 0%      18.0 ± 0%   +80.00%  (p=0.000 n=10+10)
/10-4           74.0 ± 0%     101.0 ± 0%   +36.49%  (p=0.000 n=10+10)
/100-4           707 ± 0%       911 ± 0%   +28.85%  (p=0.000 n=10+10)
/1000-4        7.01k ± 0%     9.08k ± 0%   +29.46%  (p=0.000 n=10+10)
/10000-4       70.2k ± 0%     90.4k ± 0%   +28.87%  (p=0.000 n=10+10)
/100000-4       702k ± 0%      903k ± 0%   +28.75%  (p=0.000 n=10+10)
/1000000-4     7.02M ± 0%     9.03M ± 0%   +28.73%  (p=0.000 n=10+9)

Similar results are observed across all/most tests. Refer here for details.

@christianrpetrin
Copy link
Author

I added a generic implementation of Queueimpl7 (fork), here is a PR containing the changes I did, the benchmark results suggest that the generic version is even faster.

Benchmark results

@lovromazgon, much appreciated for the contribution to impl7! However, Deque's (which is like impl7's v2) supports generics now and we're proposing to add Deque to STL (instead of impl7)

@ianlancetaylor
Copy link
Member

Note that the next hurdle here for adding container types to the standard library is some plan for iterating over a generic container. We aren't going to add any generic containers to the standard library until that is settled, so that we aren't locked into to an unusual iteration approach.

@christianrpetrin
Copy link
Author

Sounds good, @ianlancetaylor . What's the current state of the conversation? Is there a proposal for it already?

@kscooo
Copy link

kscooo commented Mar 6, 2023

@christianrpetrin Yes, related discussions are in #54245 and #56413.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests