Download Scala multisets from Maven Central: https://github.com/nicolasstucki/multisets/blob/master/MavenRepository.md
The Bag
trait represents multisets. A bag is a kind of Iterable
that groups elements together. The goal is to take advantage of these groupings to compact the space required to keep elements in the data structure and/or make some methods execute faster.
Groupings are represented by the BagBucket
trait. A bucket is an iterable that contains only equivalent elements. It additionally exposes the multiplicities of elements inside it and a sentinel (or representative element) that is equivalent to all elements of that bucket. All implementations of Bag
are collections of bucket where each bucket represents a different equivalency of elements.
The operations on bags fall into the following categories:
- Lookup:
multiplicity
,multiplicities
. Themultiplicity
method returns the multiplicity of a given element. Themultiplicities
method returns aMap
from elements to their multiplicity (0 if not contained). - Tests:
contains
,subsetOf
. Thecontains
method asks if some element is contained in the multiset. ThesubsetOf
is the musliset subset operation in relation to the other bag (A
subsetOfB
is equivalent toA.forall(e=>A.multiplicity(e)<=B.multiplicity(e))
). - Equivalences:
apply
,mostCommon
,leastCommon
. Theapply
receives an element and return a bag with all equivalent elements. ThemostCommon
/leastCommon
method returns a bag with all the elements for which the number of equivalent elements is maximized/minimized. ThemostCommon
/leastCommon
method may return a bag with several groupings. - Additions:
+
,added
. The+
method adds the element to the bag. Theadded
method adds some number of times the same element. - Removals:
-
,removed
,removedAll
. The - method removes one time element from the bag. The removed method removes some number of times the same element. TheremovedAll
removes all instances of some element. - Updates:
withMultiplicity
. The methodwithMultiplicity
returns a bag with the multiplicity of some element set to some specified value. - Multiset operations:
union
,diff
,intersect
,distinct
,maxUnion
(generalized set union). Theunion
/diff
method yields the same result as the++
/--
method. The intersect yields the multiset intersection. This implies that the multiplicity of every element in the intersection is equal to the minimum multiplicity of that element in the original bags. Thedistinct
method yields aBag
with all the distinct elements exactly once. ThemaxUnion
method yields a bag resulting from the generalized set union. This union is similar to the multiset intersection but with maximum multiplicities. - Iterators:
iterator
,distinctIterator
,bucketsIterator
. The iterator is the normal iterator from Iterable. ThedistinctIterator
returns an iterator of distinct elements, where every element is not equal to the others. The bucketsIterator returns an iterator over buckets, where elements in different bucket are not equivalent.
When a bag is mutable, it also offers side-effecting update methods:
- Additions:
+=
,add
. The+=
methods adds an element to the bag. The add method adds a number of times some element. - Removals:
-=
,remove
,removeAll
. The-=
method removes one instance of an element. The remove method removes a number of times some element. TheremoveAll
removes all instances of some element. - Updates:
setMultiplicity
. ThesetMultiplicity
method sets the multiplicity of some element to match some specified value.
This section will show how to use Bag
s with the default equivalence (i.e. natural equality ==
). The default Bag
is the immutable.HashBag
. This bag uses hashing to access to the BagBucket
buckets in constant time. There is also an implementation based on red-black trees called TreeBag
. This implementation has the added property of keeping buckets in order and has logarithmic bucket access time. For now the order of elements will be the their natural order.
// immutable Bag with a hash table containing the bag buckets
import scala.collection.Bag
import scala.collection.immutable.{HashBag=>Bag}
// or immutable Bag with a red-black tree containing the buckets
import scala.collection.immutable.{TreeBag=>Bag}
// or mutable Bag with a hash table containing the bag buckets
import scala.collection.mutable.{HashBag=>Bag}
// or mutable Bag with a red-black tree containing the buckets
import scala.collection.mutable.{TreeBag=>Bag}
It is also necessary to have an implicit BagConfiguration
. The configurations can be accessed directly with the companion object of the Bag
. It is possible to choose keeping a count of the elements with compact[T]
or keeping references to all the elements in the collection using keepAll[T]
. Note that the configuration is spawned from the implementation itself and therefore is specific to it. This implies that the configuration of one implementation may not be compatible with another implementation.
// Configuration of that keeps one reference to the element and it’s multiplicity
implicit val config = Bag.configuration.compact[String]
// or a configuration of that keeps all references to the elements
implicit val config = Bag.configuration.keepAll[String]
Now it is possible to create instances of Bag
. For this example the bag will contain one cat, three dogs and two mice. The following are equivalent statements:
// use the constructor form multiplicities of elements
val bag = Bag.from("cat"->1, "dog"->3, "mouse"->2)
// or enumerate all elements
val bag = Bag("cat", "dog", "dog", "dog", "mouse")
// or adding elements one by one
val bag = Bag.empty[String] + "cat" + "dog" + "dog" + "dog" + "mouse"
// or adding several times the same element
val bag = Bag.empty[String] added ("cat"->1) added ("dog"->3) added ("mouse"->1)
// or by setting the multiplicity of an element
val bag = Bag.empty[String] withMultiplicity ("cat", 1) withMultiplicity ("dog", 3) withMultiplicity ("mouse",1)
bag.toString >>> Bag(dog, dog, dog; cat; mouse)
Note that a semi-colon separates the buckets and commas separate the elements inside it. This semi-colon does not work in constructors.
Here are some simple examples of how the Bag operations are used and their results:
// The apply method returns a bag with all equivalent elements,
// this is more useful when using a custom equivalence (refer to ‘How to use Bag with custom equivalences’)
bag("dog") >>> Bag(dog, dog, dog)
bag.multiplicity("cat") >>> 1
bag.multiplicity("dog") >>> 3
bag.multiplicity("fish") >>> 0
bag.maxMultiplicity >>> 3 // the multiplicity of "dog"
bag.minMultiplicity >>> 1 // the multiplicity of "cat" or "mouse"
bag.mostCommon >>> Bag(dog, dog, dog)
bag.leastCommon >>> Bag(cat; mouse)
bag union Bag("cat", "fish") >>> Bag(dog, dog, dog; cat, cat; fish; mouse)
bag intersect Bag("cat", "dog", "dog", "fish") >>> Bag(dog, dog; cat)
bag diff Bag("cat", "dog", "fish") >>> Bag(dog, dog; mouse)
bag maxUnion Bag("cat", "dog", "dog", "fish") >>> Bag(dog, dog; cat; fish; mouse)
bag.withMultiplicity("dog", 1) >>> Bag(dog; cat; mouse)
Additionally for mutable bags there are the operations that change the state of that instance of the bag:
bag += "cat" >>> Bag(dog, dog, dog; cat, cat; mouse)
bag -= "dog" >>> Bag(dog, dog; cat, cat; mouse)
bag += ("mouse"-> 2) >>> Bag(dog, dog; cat, cat; mouse, mouse, mouse)
bag -= ("dog"-> 2) >>> Bag(cat, cat; mouse, mouse, mouse)
bag.add("fish"->2) >>> Bag(cat, cat; mouse, mouse, mouse; fish, fish)
bag.remove("mouse"->2) >>> Bag(cat, cat; mouse; fish, fish)
bag.setMultiplicity("cat"->3) >>> Bag(cat, cat, cat; mouse; fish, fish)
All operations of Iterable are also available, for example:
bag map (_.size) >>> Bag(5; 3, 3, 3, 3)
bag filter (_.size>3) >>> Bag("mouse")
for (animal <- bag) { ... }
...
This example shows how to use a Bag
to build a histogram. Specifically a histogram of the Math.random()
function called a number of times.
import scala.collection.mutable.HashBag
// Configuration for the bag using multiplicities of Int elements on default equivalence
implicit val bagConfiguration = HashBag.configuration.ofMultiplicities[Int]
// Create empty histogram
val histogram = HashBag.empty[Int]
val LIMIT = 50
// Add random numbers to the histogram
for (_ <- 1 to LIMIT * LIMIT) {
histogram += (Math.random() * LIMIT).toInt
}
// Print the contents of the histogram
println(
s"""
|Histogram of (Math.random() * LIMIT).toInt over ${LIMIT * LIMIT} executions
|=================================================================
| maximum occurrences: ${histogram.maxMultiplicity}
| minimum occurrences: ${histogram.minMultiplicity}
|=================================================================""".stripMargin)
for (i <- 0 until LIMIT) {
val occurrences = histogram.multiplicity(i)
println(s"${i.toString.padTo(4, ' ')}: ${"".padTo(occurrences, '|')} $occurrences")
}
println("=================================================================")
Any Bag
implementation is able to handle different bucket implementations and different equivalence groupings.
Custom equivalences are defined using instances that implement the method equiv(x,y)
of the Equiv[T]
trait. Equivalence groupings represent equivalence relations, thus the relation must satisfy:
- reflexive: ``equiv(x,x) == true
for any
x` - symmetric:
equiv(x,y) == equiv(y,x)
for anyx
and anyy
- transitive: if
equiv(x,y) == true && equiv(y,z) == true
thenequiv(x,z) == true
For both bag implementations (HashBag
and TreeBag
) it is necessary to add other traits. The HashBag requires a hashing function (hashed equivalence) and the TreeBag requires an ordering (ordered equivalence). It is possible to have Bags that only depend on equivalence but none has been implemented.
There are three basic of representations of bag buckets: MultiplicityBucket
, BagOfMultiplicities
and ListBucket
. The first two are compact representations and the last is a full representation. The full representation keeps all the references to the objects inside. The compact representation only keeps one reference to one of the objects and the multiplicity of that object in the bag. The compact representation provides faster implementations of methods involving querying and transformation.
MultiplicityBucket
is an implementation of buckets that retain one reference to one of the elements that were added to the bag and it’s multiplicity. This is the most compact representation available for bags and it works only on bags with grouping on equality. For example a the bag Bag(1, 1, 1; 2; 3, 3)
would have three MultiplicityBuckets
: one that contains the integer 1 as sentinel and the integer 3 as multiplicity MultiplicityBucket(1->3)
, one with MultiplicityBucket(2->1)
and one with MultiplicityBucket(3->2)
.
BagOfMultiplicities
is an implementation of buckets that allows custom equivalences while being a compact representation. It is actually a bag that contains only equivalent elements. The bag has a compact representation on equality using MultiplicityBucket
internally. For example the bag Bag("A","A","a";"B","b";"c")
where upper and lower cases are defined as equivalent, the bag would contain one BagOfMultiplicities
for {"A"
,"A"
,"a"
}, one for {’B’,’b’} and one for {"c"
}. And, the one that contains {"A"
,"A"
,"a"
} will be a bag with the buckets MultiplicityBucket("A"->2)
and MultiplicityBucket("a"->1)
.
ListBucket
is an implementation of buckets that keeps all the references of the elements of the bag in a List. Hence it is a full representation where there is one list for each equivalency grouping. For example the bag Bag("A", "A", "a", "B", "b"; "c")
will contain the three buckets where each contains one of the following lists: List("A","A","a")
, List("B")
or List("c")
.
Bag configurations are instances of some implementation of the BagConfiguration trait. Each bag companion object exposes three methods that create bag configurations for it. Those methods are compact
, compactWithEquiv
and keepAll
. Their signature varies depending on the kind of equivalence that is expected (hashed or sorted).
Bag.configuration.compact
: configuration for bags withMultiplicityBucket
.Bag.configuration.compactWithEquiv
: configuration for bags withBagOfMultiplicities
.Bag.configuration.keepAll
: configuration for bags withListBucket
.
Usually hash table use the hash code and the equality function of an object to place and find it. In Bags object are placed using an equivalence that does not necessarily correspond to equality. In such a case the default hash function will almost certainly not be coherent with the equivalence. Therefore to define the custom equivalence it is also necessary to define it along a coherent hashing function. A coherent hashing function is one that will have the same value for two elements that are equivalent (if equiv(x,y) == true
then hash(x)==hash(y)
). To pack the two properties (or functions) together these equivalences are represented with an instance of Equiv[T] with Hashing[T]
.
For example the case insensitive equivalence on stings, where "Cat"
, "cat"
and "CaT"
are equivalent, can be defined with:
val stringEquiv = new Equiv[String] with Hashing[String] {
def equiv(x: String, y: String): Boolean = x.toLowerCase == y.toLowerCase
def hash(x: String): Int = x.toLowerCase.hashCode
}
A simple way to make sure that the hashing is coherent with the equivalence is to have a representative element function rep(e)
, such that all elements that are equivalent are mapped to the same element. Then the equiv(x,y)
function simply returns rep(x)==rep(y)
and the hash(x)
returns rep(x).hashCode
. In the example above the string representation function is the toLowerCase method.
To create configurations for bags with elements of type T with hashed equivalences the signatures are as follows:
compact[T]
: Creates a configuration with equivalence defined as equality. The hashing used is the object's hash code. The buckets will be instances ofMultiplicityBucket
.compactWithEquiv[T](equivClass: Equiv[T] with Hashing[T])
: Creates a configuration with a custom equivalenceequivClass
. The buckets will be instances ofBagOfMultiplicities
.keepAll[T]
: Creates a configuration with equivalence defined as equality. The hashing used is the object's hash code. The buckets will be instances ofListBucket
.keepAll[T](equivClass: Equiv[T] with Hashing[T])
: Creates a configuration with a custom equivalenceequivClass
. The buckets will be instances ofListBucket
.
When using ordered trees, such as red-black trees, it is necessary to have an ordering on the elements. Therefore it is possible to take advantage of this existing requirement to define the equivalence at the same time by using only one Ordering
instance. Note that Ordering[T]
implements Equiv[T]
with equiv(x,y) = compare(x,y) == 0
.
For example the equivalence over strings where two strings are equivalent if and only if their size is the same:
val stringEquiv = new Ordering[String] {
def compare(x: String, y: String): Int = x.size compare y.size
}
To implement this function correctly for some equivalence just make sure the for equivalent elements the compare function is reflexive, symmetric, and transitive.
To create configurations for bags with elements of type T
with ordered equivalences the signatures are as follows:
compact[A](implicit equivClass: Ordering[T])
: Creates a configuration with a equivalence/ordering defined on equality. If it is used without the parameters it will try to find an implicitOrdering[T]
of for that class, this ordering must define the equivalence as equality. The buckets will be instances ofMultiplicityBucket
.compactWithEquiv[A](equivClass: Ordering[T])(implicit innerOrdering: Ordering[T])
: Creates a configuration with a custom equivalence/ordering equivClass. The buckets will be instances ofBagOfMultiplicities
. It receives an implicitOrdering[T]
that is used as ordering of the buckets, this ordering must define the equivalence as equality.keepAll[A](implicit equivClass: Ordering[T])
: Creates a configuration with a custom equivalence/orderingequivClass
. The buckets will be instances ofListBucket
.
Custom equivalences on bags are exposed on the apply
, mostCommon
, leastCommon
, bucketsIterator
and toString
methods. The other methods yield the same results with any equivalence defined on it, but the execution times may vary with different configurations.
Hashed equivalences work on mutable or immutable HashBags
. The example below shows how to configure bags with custom equivalence. The case insensitive equivalence on stings, where "Cat"
, "cat"
and "CaT"
are equivalent is used.
// immutable Bag with a hash table containing the bag buckets
import scala.collection.Bag
import scala.collection.immutable.{HashBag=>Bag}
// or mutable Bag with a hash table containing the bag buckets
import scala.collection.mutable.{HashBag=>Bag}
import scala.util.hashing.Hashing
object StringEquiv extends Equiv[String] with Hashing[String] {
def equiv(x: String, y: String): Boolean = x .toLowerCase == y.toLowerCase
def hash(x: String): Int = x.toLowerCase.hashCode
}
// Configuration of that keeps one reference to the element and it’s multiplicity and
// puts elements that are not equivalent with respect of StringEquiv in different subbags.
implicit val config = Bag.configuration.compactWithEquiv(StringEquiv)
// or a configuration of that keeps all references to the elements into lists.
// Where each list contains only equivalent elements with respect to StringEquiv.
implicit val config = Bag.configuration.keepAll(StringEquiv)
val bag = Bag("cat", "dog", "Mouse", "Cat", "Dog", "Cat" )
bag.toString() >>> Bag(cat, Cat, Cat; dog, Dog, Mouse)
bag("CAT") >>> Bag("cat", "Cat", "Cat")
bag("DoG") >>> Bag("dog", "Dog")
bag("mouse") >>> Bag("Mouse")
bag("fish") >>> Bag()
bag.mostCommon >>> Bag("cat", "Cat", "Cat")
bag.leastCommon >>> Bag("Mouse")
// List of all one representative element of each bucket
bag.bucketsIterator.map(_.sentinel).toList >>> List("cat", "Dog", "Mouse")
// Other methods that don’t take into account the equivalences still behave the same way
// For example the multiplicity method:
bag.multiplicity("cat") >>> 1
bag.multiplicity("Cat") >>> 2
bag.multiplicity("CAT") >>> 0
Sorted equivalences work on mutable or immutable TreeBags
. The example below shows how to configure bags with custom equivalence. The equivalence over strings where two strings are equivalent if and only if their size is the same is used.
// immutable Bag with a red-black trees containing the bag buckets
import scala.collection.Bag
import scala.collection.immutable.{TreeBag=>Bag}
// or mutable Bag with a red-black trees containing the bag buckets
import scala.collection.mutable.{TreeBag=>Bag}
object StringOrdering extends Ordering[String] {
def compare(x: String, y: String): Int = x.size compare y.size
}
// Configuration of that keeps one reference to the element and it’s multiplicity
// and puts elements that are not equivalent with respect of StringOrdering in different sub-bags.
// Buckets are in the order defined in StringOrdering
implicit val config = Bag.configuration.compactWithEquiv(StringOrdering)
// or a configuration of that keeps all references to the elements into lists.
// Where each list contains only equivalent elements with respect to StringOrdering.
// Buckets are in the order defined in StringOrdering.
implicit val config = Bag.configuration.keepAll(StringOrdering).
val bag = Bag("cat", "mouse","fish", "elephant", "dog", "bear" )
bag.toString() >>> Bag(cat, dog; bear, fish; mouse; elephant)
bag("cat") >>> Bag("cat", "dog")
bag("ooo") >>> Bag("cat", "dog")
bag("xxxx") >>> Bag("fish", "bear")
bag.mostCommon >>> Bag("cat", "dog", "fish", "bear")
bag.leastCommon >>> Bag( "mouse", "elephant")
// List with the sizes of the strings in the bag
bag.bucketsIterator.map(_.sentinel.size).toList >>> List(3, 4, 5, 8)
Complete examples of simple applications that use bags are available:
- The complete histogram example at https://github.com/nicolasstucki/multisets-examples/tree/master/src/main/scala/examples/histogram.
- Implementation of a generic genetic algorithm using bags. With to simple applications that use it: simple bit accumulation and Traveling Salesman. At https://github.com/nicolasstucki/multisets-examples/tree/master/src/main/scala/examples/genetic_algorithm.
- Directory analyser: uses different equivalences to classify keep track and classify files. At https://github.com/nicolasstucki/multisets-examples/tree/master/src/main/scala/examples/directory.