-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathmongo.txt
1982 lines (1982 loc) · 84.8 KB
/
mongo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
A MongoDB White Paper
September 2013
MongoDB Operations Best Practices
MongoDB 2.4
INTRODUCTION 1
Roles and Responsibilities 1
Data Architect 2
Database Administrator (DBA) 2
System Administrator (sysadmin) 2
Application Developer 2
Network Administrator 2
I. PREPARING FOR A MONGODB
DEPLOYMENT 2
Schema Design 2
Document Size 3
Data Lifecycle Management 4
Indexing 5
Working Sets 7
MongoDB Setup and Configuration 8
Data Migration 8
Hardware 9
Operating System and File System
Configurations for Linux 10
Networking 11
Production Recommendations 11
II. HIGH AVAILABILITY 11
Journaling 12
Data Redundancy 12
Availability of Writes 13
Read Preferences 13
III. SCALING A MONGODB SYSTEM 14
Horizontal Scaling with Shards 14
Selecting a Shard Key 14
Sharding Best Practices 16
Dynamic Data Balancing 16
Sharding and Replica Sets 16
Geographic Distribution 16
IV. DISASTER RECOVERY 16
Multi-Data Center Replication 17
Backup and Restore 17
V. CAPACITY PLANNING 18
Monitoring Tools 18
Things to Monitor 19
VI. SECURITY 21
Defense in Depth 21
Access Control 22
Kerberos Authentication 22
Identity Management 22
SSL 22
Data Encryption 22
Query Injection 23
CONCLUSION 23
ABOUT MONGODB 23
RESOURCES 23
Contents
1
MongoDB is the open-source, document database
that is popular among both developers and
operations professionals given its agile and
scalable approach. MongoDB is used in hundreds of
production deployments by organizations ranging in
size from emerging startups to the largest Fortune
50 companies. This paper provides guidance on best
practices for deploying and managing a MongoDB
cluster. It assumes familiarity with the architecture
of MongoDB and a basic understanding of concepts
related to the deployment of enterprise software.
Fundamentally MongoDB is a database and the
concepts of the system, its operations, policies, and
procedures should be familiar to users who have
deployed and operated other database systems.
While some aspects of MongoDB are different
from traditional relational database systems, skills
and infrastructure developed for other database
systems are relevant to MongoDB and will help to
make deployments successful. Typically MongoDB
users find that existing database administrators,
system administrators, and network administrators
need minimal training to understand MongoDB.
The concepts of a database, tuning, performance
monitoring, data modeling, index optimization and
other topics are very relevant to MongoDB. Because
MongoDB is designed to be simple to administer
and to deploy in large clustered environments, most
users of MongoDB find that with minimal training
an existing operations professional can become
competent with MongoDB, and that MongoDB
expertise can be gained in a relatively short
period of time.
This document discusses many best practices for
operating and deploying a MongoDB system. The
MongoDB community is vibrant and new techniques
and lessons are shared every day.
This document is subject to change. For the most
up to date version of this document please visit
mongodb.com. For the most current and detailed
information on specific topics, please see the online
documentation at mongodb.org. Many links are
provided throughout this document to help guide
users to the appropriate resources online.
ROLES AND RESPONSIBILITIES
Applications deployed on MongoDB require careful
planning and the coordination of a number of roles in
an organization’s technical teams to ensure successful
maintenance and operation. Organizations tend to
MongoDB Operations Best Practices
2
find many of the same individuals and their respective
roles for traditional technology deployments are
appropriate for a MongoDB deployment: Data
Architects, Database Administrators, System
Administrators, Application Developers, and Network
Administrators.
In smaller organizations it is not uncommon to
find these roles are provided by a small number of
individuals, each potentially fulfilling multiple roles,
whereas in larger companies it is more common
for each role to be provided by an individual or
team dedicated to those tasks. For example, in a
large investment bank there may be a very strong
delineation between the functional responsibilities
of a DBA and those of a system administrator.
DATA ARCHITECT
While modeling data for MongoDB is typically
simpler than modeling data for a relational database,
there tend to be multiple options for a data model,
and tradeoffs with each alternative regarding
performance, resource utilization, ease of use, and
other areas. The data architect can carefully weigh
these options with the development team to make
informed decisions regarding the design of the
schema. Typically the data architect performs tasks
that are more proactive in nature, whereas the
database administrator may perform tasks that are
more reactive.
DATABASE ADMINISTRATOR (DBA)
As with other database systems, many factors should
be considered in designing a MongoDB system for
a desired performance SLA. The DBA should be
involved early in the project regarding discussions
of the data model, the types of queries that will
be issued to the system, the query volume, the
availability goals, the recovery goals, and the
desired performance characteristics.
SYSTEM ADMINISTRATOR (SYSADMIN)
Sysadmins typically perform a set of activities similar
to those required in managing other applications,
including upgrading software and hardware,
managing storage, system monitoring, and data
migration. MongoDB users have reported that their
sysadmins have had no trouble learning to deploy,
manage and monitor MongoDB because no special
skills are required.
APPLICATION DEVELOPER
The application developer works with other members
of the project team to ensure the requirements
regarding functionality, deployment, security, and
availability are clearly understood. The application
itself is written in a language such as Java, C#, PHP
or Ruby. Data will be stored, updated, and queried in
MongoDB, and language-specific drivers are used to
communicate between MongoDB and the application.
The application developer works with the data
architect to define and evolve the data model and to
define the query patterns that should be optimized.
The application developer works with the database
administrator, sysadmin and network administrator to
define the deployment and availability requirements
of the application.
NETWORK ADMINISTRATOR
A MongoDB deployment typically involves multiple
servers distributed across multiple data centers.
Network resources are a critical component of a
MongoDB system. While MongoDB does not require
any unusual configurations or resources as compared
to other database systems, the network administrator
should be consulted to ensure the appropriate
policies, procedures, configurations, capacity, and
security settings are implemented for the project.
I. Preparing for a
MongoDB Deployment
SCHEMA DESIGN
Developers and data architects should work together
to develop the right data model, and they should
invest time in this exercise early in the project. The
application should drive the data model, updates, and
queries of your MongoDB system. Given MongoDB’s
dynamic schema, developers and data architects can
continue to iterate on the data model throughout the
development and deployment processes to optimize
performance and storage efficiency.
3
The topic of schema design is significant, and a full
discussion is beyond the scope of this document. A
number of resources are available online, including
conference presentations from MongoDB Solutions
Architects and users, as well as no-cost, web-based
training provided by MongoDB University.. Briefly, some
concepts to keep in mind:
Document Model
MongoDB stores data as documents in a binary
representation called BSON. The BSON encoding
extends the popular JSON representation to include
additional types such as int, long, and floating point.
BSON documents contain one or more fields, and
each field contains a value of a specific data type,
including arrays, sub-documents and binary data.
It may be helpful to think of documents as roughly
equivalent to rows in a relational database, and fields
as roughly equivalent to columns. However, MongoDB
documents tend to have all data for a given record in
a single document, whereas in a relational database
information for a given record is usually spread across
rows in many tables. In other words, data in MongoDB
tends to be more localized.
Dynamic Schema
MongoDB documents can vary in structure. For
example, documents that describe users might all
contain the user_id and the last_date they logged into
the system, but only some of these documents might
contain the user’s shipping address, and perhaps
some of those contain multiple shipping addresses.
MongoDB does not require that all documents
conform to the same structure. Furthermore, there is
no need to declare the structure of documents to the
system – documents are self-describing.
Collections
Collections are groupings of documents. Typically
all documents in a collection have similar or related
purposes for an application. It may be helpful to
think of collections as being analogous to tables in a
relational database.
Indexes
MongoDB uses B-tree indexes to optimize queries.
Indexes are defined in a collection on document
fields. MongoDB includes support for many indexes,
including compound, geospatial, TTL, text search,
sparse, unique, and others. For more information see
the section on indexes.
Transactions
MongoDB guarantees atomic updates to data at the
document level. It is not possible to update multiple
documents in a single atomic operation. Atomicity
of updates may influence the schema for your
application.
Schema Enforcement
MongoDB does not enforce schemas. Schema
enforcement should be performed by the application.
For more information on schema design, please see
Data Modeling Considerations for MongoDB in the
MongoDB Documentation.
DOCUMENT SIZE
The maximum BSON document size in MongoDB is
16MB. User should avoid certain application patterns
that would allow documents to grow unbounded. For
instance, applications should not typically update
documents in a way that causes them to grow
significantly after they have been created, as this can
lead to inefficient use of storage. If the document size
exceeds its allocated space, MongoDB will relocate
the document on disk. This automatic process can
be resource intensive and time consuming, and can
unnecessarily slow down other operations in the
database.
For example, in an ecommerce application it would
be difficult to estimate how many reviews each
product might receive from customers. Furthermore,
it is typically the case that only a subset of reviews
is displayed to a user, such as the most popular or
the most recent reviews. Rather than modeling the
product and customer reviews as a single document
4
it would be better to model each review or groups
of reviews as a separate document with a reference
to the product document. This approach would also
allow the reviews to reference multiple versions of
the product such as different sizes or colors.
Optimizing for Document Growth
MongoDB adaptively learns if the documents
in a collection tend to grow in size and assigns
a padding factor to provide sufficient space for
document growth. This factor can be viewed
as the paddingFactor field in the output of the
db.<collection-name>.stats() command. For example,
a value of 1 indicates no padding factor, and a value
of 1.5 indicates a padding factor of 50%.
When a document is updated in MongoDB the data
is updated in-place if there is sufficient space. If the
size of the document is greater than the allocated
space, then the document may need to be re-written
in a new location in order to provide sufficient space.
The process of moving documents and updating
their associated indexes can be I/O-intensive and can
unnecessarily impact performance.
Space Allocation Tuning
Users who anticipate updates and document growth
may consider two options with respect to padding.
First, the usePowerOf2Sizes attribute can be set on
a collection. This setting will configure MongoDB
to round up allocation sizes to the powers of 2 (e.g.,
2, 4, 8, 16, 32, 64, etc). This setting tends to reduce
the chances of increased disk I/O at the cost of using
some additional storage. The second option is to
manually pad the documents. If the application will
add data to a document in a predictable fashion, the
fields can be created in the document before the
values are known in order to allocate the appropriate
amount of space during document creation. Padding
will minimize the relocation of documents and
thereby minimize over-allocaiton.
Gridfs
For files larger than 16MB, MongoDB provides a
convention called GridFS, which is implemented by
all MongoDB drivers. GridFS automatically divides
large data into 256KB pieces called “chunks” and
maintains the metadata for all chunks. GridFS allows
for retrieval of individual chunks as well as entire
documents. For example, an application could quickly
jump to a specific timestamp in a video. GridFS is
frequently used to store large binary files such as
images and videos in MongoDB.
DATA LIFECYCLE MANAGEMENT
MongoDB provides features to facilitate the
management of data lifecycles, including Time to
Live, and capped collections.
Time to Live (TTL)
If documents in a collection should only persist for
a pre-defined period of time, the TTL feature can be
used to automatically delete documents of a certain
age rather than scheduling a process to check the
age of all documents and run a series of deletes. For
example, if user sessions should only exist for one
hour, the TTL can be set for 3600 seconds for a date
field called lastActivity that exists in documents used
to track user sessions and their last interaction with
the system. A background thread will automatically
check all these documents and delete those that
have been idle for more than 3600 seconds. Another
example for TTL is a price quote that should
automatically expire after a period of time.
Capped Collections
In some cases a rolling window of data should
be maintained in the system based on data size.
Capped collections are fixed-size collections that
support high-throughput inserts and reads based on
insertion order. A capped collection behaves like a
circular buffer: data is inserted into the collection,
that insertion order is preserved, and when the total
size reaches the threshold of the capped collection,
the oldest documents are deleted to make room
for the newest documents. For example, store log
information from a high-volume system in a capped
collection to quickly retrieve the most recent log
entries without designing for storage management.
Dropping a Collection
It is very efficient to drop a collection in MongoDB. If
your data lifecycle management requires periodically
deleting large volumes of documents, it may be best
to model those documents as a single collection.
Dropping a collection is much more efficient than
5
removing all documents or a large subset of a collection,
just as dropping a table is more efficient than deleting all
the rows in a table in a relational database.
INDEXING
Like most database management systems, indexes are
a crucial mechanism for optimizing system performance
in MongoDB. And while indexes will improve the
performance of some operations by one or more orders
of magnitude, they have associated costs in the form of
slower updates, disk usage, and memory usage. Users
should always create indexes to support queries, but
should take care not to maintain indexes that the queries
do not use. Each index incurs some cost for every insert
and update operation: if the application does not use
these indexes, then it can adversely affect the overall
capacity of the database. This is particularly important
for deployments that have insert-heavy workloads.
Query Optimization
Queries are automatically optimized by MongoDB to
make evaluation of the query as efficient as possible.
Evaluation normally includes the selection of data based
on predicates, and the sorting of data based on the sort
criteria provided. Generally MongoDB makes use of one
index in resolving a query. The query optimizer selects
the best index to use by periodically running alternate
query plans and selecting the index with the lowest scan
count for each query type. The results of this empirical
test are stored as a cached query plan and periodically
updated.
MongoDB provides an explain plan capability that shows
information about how a query was resolved, including:
• The number of documents returned.
• Which index was used.
• Whether the query was covered, meaning no
documents needed to be read to return results.
• Whether an in-memory sort was performed, which
indicates an index would be beneficial.
• The number of index entries scanned.
• How long the query took to resolve in milliseconds.
The explain plan will show 0 milliseconds if the query
was resolved in less than 1ms, which is not uncommon
in well-tuned systems. When explain plan is called, prior
cached query plans are abandoned, and the process of
testing multiple indexes is evaluated to ensure the best
possible plan is used.
If the application will always use indexes, MongoDB can
be configured to throw an error if a query is issued that
requires scanning the entire collection.
PROFILING
MongoDB provides a profiling capability called Database
Profiler, which logs fine-grained information about
database operations. The profiler can be enabled to log
information for all events or only those events whose
duration exceeds a configurable threshold (whose default
is 100ms). Profiling data is stored in a capped collection
where it can easily be searched for interesting events – it
may be easier to query this collection than parsing the
log files.
Primary and Secondary Indexes
A unique, index is created for all documents by the _id
field. MongoDB will automatically create the _id field and
assign a unique value, or the value can be specified when
the document is inserted. All user-defined indexes are
secondary indexes. Any field can be used for a secondary
index, including fields with arrays.
Compound Indexes
Generally queries in MongoDB can only be optimized
by one index at a time. It is therefore useful to create
compound indexes for queries that specify multiple
predicates. For example, consider an application that
stores data about customers. The application may need to
find customers based on last name, first name, and state
of residence. With a compound index on last name, first
name, and state of residence, queries could efficiently
locate people with all three of these values specified.
An additional benefit of a compound index is that any
leading field within the index can be used, so fewer
indexes on single fields may be necessary: this compound
index would also optimize queries looking for customers
by last name.
Unique Indexes
By specifying an index as unique, MongoDB will reject
inserts of new documents or the update of a document
with an existing value for the field for which the unique
index has been created. By default all indexes are not
unique. If a compound index is specified as unique, the
combination of values must be unique. If a document
does not have a value specified for the field then an
6
index entry with a value of null will be created for the
document. Only one document may have a null value
for the field unless the sparse option is enabled for
the index, in which case index entries are not made
for documents that do not contain the field.
Array Indexes
For fields that contain an array, each array value
is stored as a separate index entry. For example,
documents that describe recipes might include a field
for ingredients. If there is an index on the ingredient
field, each ingredient is indexed and queries on
the ingredient field can be optimized by this index.
There is no special syntax required for creating array
indexes – if the field contains an array, it will be
indexed as a array index.
It is also possible to specify a compound array
index. If the recipes also contained a field for the
number of calories, a compound index on calories
and ingredients could be created, and queries that
specified a value for calories and ingredients would
be optimized with this index. For compound array
indexes only one of the fields can be an array in each
document.
TTL Indexes
In some cases data should expire out of the system
automatically. Time to Live (TTL) indexes allow the
user to specify a period of time after which the data
will automatically be deleted from the database.
A common use of TTL indexes is applications that
maintain a rolling window of history (e.g., most recent
100 days) for user actions such as click streams.
Geospatial Indexes
MongoDB provides geospatial indexes to optimize
queries related to location within a two dimensional
space, such as projection systems for the earth. The
index supports data stored as both GeoJSON objects
and as regular 2D coordinate pairs. Documents
must have a field with a two-element array, such
as latitude and longitude to be indexed with a
geospatial index. These indexes allow MongoDB to
optimize queries that request all documents closest
to a specific point in the coordinate system.
Sparse Indexes
Sparse indexes only contain entries for documents
that contain the specified field. Because the
document data model of MongoDB allows for
flexibility in the data model from document to
document, it is common for some fields to be present
only in a subset of all documents. Sparse indexes
allow for smaller, more efficient indexes when fields
are not present in all documents.
By default, the sparse option for indexes is false.
Using a sparse index will sometime lead to
incomplete results when performing index-based
operations such as filtering and sorting. By default,
MongoDB will create null entries in the index for
documents that are missing the specified field.
Text Search Indexes
MongoDB provides a specialized index for text search
that uses advanced, language-specific linguistic rules
for stemming, tokenization and stop words. Queries
that use the text search index will return documents
in relevance order. One or more fields can be included
in the text index.
Hash Indexes
Hash indexes compute a hash of the value of a
field and index the hashed value. The primary use
of this index is to enable hash-based sharding of
a collection, a simple and uniform distribution of
documents across shards.
For more on indexes, see Indexing Overview in the
MongoDB Documentation.
Index Creation Options
Indexes and data are updated synchronously in
MongoDB. The appropriate indexes should be
determined as part of the schema design process
prior to deploying the system.
By default creating an index is a blocking operation
in MongoDB. Because the creation of indexes can
be time and resource intensive, MongoDB provides
an option for creating new indexes as a background
operation. When the background option is enabled,
the total time to create an index will be greater than
if the index was created in the foreground, but it will
still be possible to use the database while creating
indexes. In addition, multiple indexes can be built
concurrently in the background.
7
Production Application Checks for Indexes
Make sure that the application checks for the
existence of all appropriate indexes on startup
and that it terminates if indexes are missing. Index
creation should be performed by separate application
code and during normal maintenance operations.
Index Maintenance Operations
Background index operations on a replica set primary
become foreground index operations on replica
set secondaries, which will block all replication.
Therefore the best approach to building indexes on
replica sets is to:
1. Restart the secondary replica in
standalone mode.
2. Build the indexes.
3. Restart as a member of the replica set.
4. Allow the secondary to catch up to the other
members of the replica set.
5. Proceed to step one with the next secondary.
6. When all the indexes have been built on the
secondaries, restart the primary in standalone
mode. One of the secondaries will be elected
as primary so the application can continue
to function.
7. Build the indexes on the original primary, then
restart it as a member of the replica set.
8. Issue a request for the original primary to resume
its role as primary replica.
See the MongoDB Documentation for Build Index on
Replica Sets for a full set of procedures.
Index Limitations
There are a few limitations to indexes that should be
observed when deploying MongoDB:
• A collection cannot have more than 64 indexes.
• Index entries cannot exceed 1024 bytes.
• The name of an index must not exceed 128
characters (including its namespace).
• The optimizer generally uses one index at a time.
• Indexes consume disk space and memory.
Use them as necessary.
• Indexes can impact update performance –
an update must first locate the data to change,
so an index will help in this regard, but index
maintenance itself has overhead and this work
will slow update performance.
• In-memory sorting of data without an index
is limited to 32MB. This operation is very CPU
intensive, and in-memory sorts indicate an index
should be created to optimize these queries.
Common Mistakes Regarding Indexes
The following tips may help to avoid some common
mistakes regarding indexes:
• Creating multiple indexes in support of a single
query: MongoDB will use a single index to
optimize a query. If you need to specify multiple
predicates, you need a compound index. For
example, if there are two indexes, one on first
name and another on last name, queries that
specify a constraint for both first and last names
will only use one of the indexes, not both. To
optimize these queries, a compound index on last
name and first name should be used.
• Compound indexes: Compound indexes are
defined and ordered by field. So, if a compound
index is defined for last name, first name, and
city, queries that specify last name or last name
and first name will be able to use this index, but
queries that try to search based on city will not
be able to benefit from this index.
• Low selectivity indexes: An index should
radically reduce the set of possible documents to
select from. For example, an index on a field that
indicates male/female is not as beneficial as an
index on zip code, or even better, phone number.
• Regular expressions: Trailing wildcards work
well, but leading wildcards do not because the
indexes are ordered.
• Negation: Inequality queries are inefficient with
respect to indexes.
WORKING SETS
MongoDB makes extensive use of RAM to speed up
database operations. In MongoDB, all data is read and
manipulated through memory-mapped files. Reading
data from memory is measured in nanoseconds and
reading data from disk is measured in milliseconds;
reading from memory is approximately 100,000
8
times faster than reading data from disk. The set of
data and indexes that are accessed during normal
operations is call the working set.
It should be the goal of the deployment team that the
working fits in RAM. It may be the case the working
set represents a fraction of the entire database, such
as in applications where data related to recent events
or popular products is accessed most commonly.
Page faults occur when MongoDB attempts to access
data that has not been loaded in RAM. If there is
free memory then the operating system can locate
the page on disk and load it into memory directly.
However, if there is no free memory the operating
system must write a page that is in memory to disk
and then read the requested page into memory.
This process can be time consuming and will be
significantly slower than accessing data that is
already in memory.
Some operations may inadvertently purge a large
percentage of the working set from memory, which
adversely affects performance. For example, a query
that scans all documents in the database, where
the database is larger than the RAM on the server,
will cause documents to be read into memory and
the working set to be written out to disk. Other
examples include some maintenance operations such
as compacting or repairing a database and rebuilding
indexes.
If your database working set size exceeds the
available RAM of your system, consider increasing the
RAM or adding additional servers to the cluster and
sharding your database. For a discussion on this topic,
see the section on Sharding Best Practices.. It is far
easier to implement sharding before the resources of
the system become limited, so capacity planning is an
important element in the successful delivery of the
project.
A useful output included with the serverStatus
command is a workingSet document that provides an
estimated size of the MongoDB instance’s working
set. Operations teams can track the number of pages
accessed by the instance over a given period, and the
elapsed time from the oldest to newest document
in the working set. By tracking these metrics,
it is possible to detect when the working set is
approaching current RAM limits and proactively take
action to ensure the system is scaled.
MONGODB SETUP AND CONFIGURATION
Setup
MongoDB provides repositories for .deb and .rpm
packages for consistent setup, upgrade, system
integration, and configuration, . This software uses
the same binaries as the tarball packages provided at
http://www.mongodb.org/downloads.
Database Configuration
User should store configuration options in mongod’s
configuration file. This allows sysadmins to
implement consistent configurations across entire
clusters. The configuration files support all options
provided as command line options for mongod.
Installations and upgrades should be automated
through popular tools such as Chef and Puppet, and
the MongoDB community provides and maintains
example scripts for these tools.
Upgrades
User should upgrade software as often as possible so
that they can take advantage of the latest features as
well as any stability updates or bug fixes. Upgrades
should be tested in non-production environments
to ensure production applications are not adversely
affected by new versions of the software.
Customers can deploy rolling upgrades without
incurring any downtime, as each member of a replica
set can be upgraded individually without impacting
cluster availability. It is possible for each member
of a replica set to run under different versions of
MongoDB. As a precaution, the release notes for the
MongoDB release should be consulted to determine
if there is a particular order of upgrade steps that
needs to be followed and whether there are any
incompatibilities between two specific versions.
DATA MIGRATION
Users should assess how best to model their data
for their applications rather than simply importing
the flat file exports of their legacy systems. In a
traditional relational database environment, data
tends to be moved between systems using delimited
flat files such as CSV files. While it is possible to
ingest data into MongoDB from CSV files, this may in
fact only be the first step in a data migration process.
9
It is typically the case that MongoDB’s document data
model provides advantages and alternatives that do
not exist in a relational data model.
The mongoimport and mongoexport tools are
provided with MongoDB for simple loading or
exporting of data in JSON or CSV format. These tools
may be useful in moving data between systems as
an initial step. Other tools called mongodump and
mongorestore are useful for moving data between
two MongoDB systems.
There are many options to migrate data from flat files
into rich JSON documents, including custom scripts,
ETL tools and from within an application itself which
can read from the existing RDBMS and then write a
JSON version of the document back to MongoDB.
HARDWARE
The following suggestions are only intended to
provide high-level guidance for hardware for a
MongoDB deployment. The specific configuration
of your hardware will be dependent on your data,
your queries, your performance SLA, your availability
requirements, and the capabilities of the underlying
hardware components. MongoDB has extensive
experience helping customers to select hardware
and tune their configurations and we frequently
work with customers to plan for and optimize their
MongoDB systems.
MongoDB was specifically designed with commodity
hardware in mind and has few hardware requirements
or limitations. Generally speaking, MongoDB will take
advantage of more RAM and faster CPU clock speeds.
Memory
MongoDB makes extensive use of RAM to increase
performance. Ideally, the full working set fits in RAM.
As a general rule of thumb, the more RAM, the better.
As workloads begin to access data that is not in RAM,
the performance of MongoDB will degrade. MongoDB
delegates the management of RAM to the operating
system. MongoDB will use as much RAM as possible
until it exhausts what is available.
Storage
MongoDB does not require shared storage (e.g.,
storage area networks). MongoDB can use local
attached storage as well as solid state drives (SSDs).
Most disk access patterns in MongoDB do not have
sequential properties, and as a result, customers may
experience substantial performance gains by using
SSDs. Good results and strong price to performance
have been observed with SATA SSD and with PCI.
Commodity SATA spinning drives are comparable to
higher cost spinning drives due to the non-sequential
access patterns of MongoDB: rather than spending
more on expensive spinning drives, that money may
be more effectively spent on more RAM or SSDs.
Another benefit of using SSDs is that they provide
a more gradual degradation of performance if the
working set no longer fits in memory.
While data files benefit from SSDs, MongoDB’s journal
files are good candidates for fast, conventional disks
due to their high sequential write profile. See the
section on journaling later in this guide for more
information.
Most MongoDB deployments should use RAID-
10. RAID-5 and RAID-6 do not provide sufficient
performance. RAID-0 provides good write
performance, but limited read performance and
insufficient fault tolerance. MongoDB’s replica sets
allow deployments to provide stronger availability for
data, and should be considered with RAID and other
factors to meet the desired availability SLA.
CPU
MongoDB performance is typically not CPU-bound.
As MongoDB rarely encounters workloads able to
leverage large numbers of cores, it is preferable to
have servers with faster clock speeds than numerous
cores with slower clock speeds.
Server Capacity VS. Server Quantity
MongoDB was designed with horizontal scale-out
in mind using cost-effective, commodity hardware.
Even within the commodity server market there are
options regarding the number of processors, amount
of RAM, and other components. Customers frequently
ask whether it is better to have a smaller number
of larger capacity servers or a larger number of
smaller capacity servers. In a MongoDB deployment
it is important to ensure there is sufficient RAM to
keep the database working set in memory. While
it is not required that the working set fit in RAM,
the performance of the database will degrade if a
significant percentage of reads and writes are applied
to data and indexes that are not in RAM.
10
Process Per Host
Users should run one mongod process per host. A
mongod process is designed to run as the single
server process on a system; doing so enables it to
store its working set in memory most efficiently.
Running multiple processes on a single host reduces
redundancy and risks operational degradation, as
multiple instances compete for the same resources.
The exception is for mongod processes that are acting
in the role of arbiter – these may co-exist with other
processes or be deployed on smaller hardware.
Virtualization and IaaS
Customers can deploy MongoDB on bare metal
servers, in virtualized environments and in the
cloud. Performance will typically be best and most
consistent using bare metal, though numerous
MongoDB users leverage infrastructure-as-a-service
(IaaS) products like Amazon Web Services’ Elastic
Compute Cloud (AWS EC2), Rackspace, etc.
IaaS deployments are especially good for initial
testing and development, as they provide a low-risk,
low-cost means for getting the database up and
running. MongoDB has partnerships with a number of
cloud and managed services providers, such as AWS,
IBM with Softlayer and Microsoft Windows Azure,
in addition to partners that provide fully managed
instances of MongoDB, like MongoLab, MongoHQ and
Rackspace with ObjectRocket.
Sizing for Mongos and Config Server Processes
For sharded systems, additional processes must be
deployed with the mongod data storing processes:
mongos and config servers. Shards are physical
partitions of data spread across multiple servers.
For more on sharding, please see the section on
horizontal scaling with shards. Queries are routed to
the appropriate shards using a query router process
called mongos. The metadata used by mongos to
determine where to route a query is maintained by
the config servers. Both mongos and config server
processes are lightweight, but each has somewhat
different requirements regarding sizing.
Within a shard, MongoDB further partitions
documents into chunks. MongoDB maintains
metadata about the relationship of chunks to
shards in the config server. Three config servers
are maintained in sharded deployments to ensure
availability of the metadata at all times. To estimate
the total size of the shard metadata, multiply the
size of the chunk metadata times the total number
of chunks in your database – the default chunk size
is 64MB. For example, a 64TB database would have
1 million chunks and the total size of the shard
metadata managed by the config servers would be 1
million times the size of the chunk metadata, which
could range from hundreds of MB to several GB of
metadata. Shard metadata access is infrequent:
each mongos maintains a cache of this data, which
is periodically updated by background processes
when chunks are split or migrated to other shards.
The hardware for a config server should therefore be
focused on availability: redundant power supplies,
redundant network interfaces, redundant RAID
controllers, and redundant storage should be used.
Typically multiple mongos instances are used in a
sharded MongoDB system. It is not uncommon for
MongoDB users to deploy a mongos instance on each
of their application servers. The optimal number of
mongos servers will be determined by the specific
workload of the application: in some cases mongos
simply routes queries to the appropriate shards, and
in other cases mongos performs aggregation and
other tasks. To estimate the memory requirements for
each mongos, consider the following:
• The total size of the shard metadata that is
cached by mongos
• 1MB for each connection to applications and to
each mongos
While mongod instances are typically limited by disk
performance and available RAM more than they are
limited by CPU speed, mongos uses limited RAM and
will benefit from fast CPUs and networks.
OPERATING SYSTEM AND FILE SYSTEM
CONFIGURATIONS FOR LINUX
Only 64-bit versions of operating systems should be
used for MongoDB. Version 2.6.36 of the Linux kernel
or later should be used for MongoDB in production.
Because MongoDB pre-allocates its database files
before using them and because MongoDB uses very
large files on average, Ext4 and XFS file systems are
recommended:
11
• If you use the Ext4 file system, use at least
version 2.6.23 of the Linux Kernel.
• If you use the XFS file system, use at least
version 2.6.25 of the Linux Kernel.
For MongoDB on Linux use the following
recommended configurations:
• Turn off atime for the storage volume with the
database files.
• Do not use hugepages virtual memory pages,
MongoDB performs better with normal virtual
memory pages.
• Disable NUMA in your BIOS or invoke mongod
with NUMA disabled.
• Ensure that readahead settings for the block
devices that store the database files are
relatively small as most access is non-sequential.
For example, setting readahead to 32 (16KB) is a
good starting point.
• Synchronize time between your hosts. This is
especially important in MongoDB clusters.
Linux provides controls to limit the number of
resources and open files on a per-process and peruser
basis. The default settings may be insufficient
for MongoDB. Generally MongoDB should be the only
process on a system to ensure there is no contention
with other processes.
While each deployment has unique requirements, the
following settings are a good starting point mongod
and mongos instances. Use ulimit to apply these
settings:
• -f (file size): unlimited
• -t (cpu time): unlimited
• -v (virtual memory): unlimited
• -n (open files): 64000
• -m (memory size): unlimited
• -u (processes/threads): 32000
For more on using ulimit to set the resource limits for
MongoDB, see the MongoDB Documentation page on
Linux ulimit Settings.
NETWORKING
Always run MongoDB in a trusted environment with
network rules that prevent access from all unknown
entities. There are a finite number of pre-defined
processes that communicate with a MongoDB system:
application servers, monitoring processes, and
MongoDB processes.
By default MongoDB processes will bind to all
available network interfaces on a system. If your
system has more than one network interface, bind
MongoDB processes to the private or internal
network interface.
Detailed information on default port numbers for
MongoDB, configuring firewalls for MongoDB,
VPN, and other topics is available on the MongoDB
Documentation page for Security Practices and
Management.
PRODUCTION-PROVEN RECOMMENDATIONS
The latest suggestions on specific configurations for
operating systems, file systems, storage devices and
other system-related topics are maintained on the
MongoDB Documentation Production Notes page.
II. High Availability
Under normal operating conditions, a MongoDB
cluster will perform according to the performance
and functional goals of the system. However, from
time to time certain inevitable failures or unintended
actions can affect a system in adverse ways. Hard
drives, network cards, power supplies, and other
hardware components will fail. These risks can be
mitigated with redundant hardware components.
Similarly, a MongoDB system provides configurable
redundancy throughout its software components as
well as configurable data redundancy.
12
JOURNALING
MongoDB implements write-ahead journaling to
enable fast crash recovery and consistency in the
storage engine. Journaling is enabled by default
for 64-bit platforms. Users should never disable
journaling; journaling helps prevent corruption and
increases operational resilience. Journal commits are
issued at least as often as every 100ms by default.
In the case of a server crash, journal entries will be
recovered automatically. Therefore the time between
journal commits represents the maximum possible
data loss. This setting can be configured to a value
that is appropriate for the application.
It may be beneficial for performance to locate
MongoDB’s journal files and data files on separate
storage arrays. The I/O patterns for the journal are
very sequential in nature and are well suited for
storage devices that are optimized for fast sequential
writes, whereas the data files are well suited for
storage devices that are optimized for random
reads and writes. Simply placing the journal files
on a separate storage device normally provides
some performance enhancements by reducing disk
contention.
DATA REDUNDANCY
MongoDB maintains multiple copies of data, called
replica sets, using native replication. Users should