layout | title | sched-activation |
---|---|---|
course |
Logging (Week 12, Monday, March 31) |
class="active" |
From Python 3 logging tutorial:
logging.debug('This message should go to the log file')
logging.info('So should this')
logging.warning('And this, too')
2010-12-12 11:41:42,612 DEBUG:root:This message should go to the log file
2010-12-12 11:41:43,015 INFO:root:So should this
2010-12-12 11:42:35,756 WARNING:root:And this, too
- When you record it?
- What information do you record?
- Do you have levels (DEBUG, INFO, WARNING)?
- When do you turn levels on and off?
- How do you analyze the logs?
Task you want to perform | The best tool |
---|---|
Display output for ordinary use of a command-line program | print() |
Report events from normal operation | logging.info() or logging.debug() |
Issue a warning regarding an event | logging.warning() if there is nothing the application can do |
Report an error from a specific event | Raise an exception |
Report suppression of an error in a long-running process | logging.error() , logging.exception() , or logging.critical() , as appropriate |
From {{site.data.bibliography.hull2013.title}}, p. 58:
Number 7: Insufficient monitoring and metrics
- "it should be so basic you cannot imagine working without it"
Number 10: Insufficient logging
- "You may enable a lot more of it when you are troubleshooting and debugging, but on an ongoing basis you will need it for key essential services"
From {{site.data.bibliography.hamilton2007.title}}, pp. 231--232:
From {{site.data.bibliography.hoff2007.title}}.
For highly-available applications
-
Log everything
-
Log all the time
-
Only have two levels, NORMAL and DEBUG
-
Turn on debugging per-module
-
Every event should include the id of the customer request that started it:
- Customer S. Lee requested a resize of image 'my-vacation-july-24-444.jpg' => Request
QX3567187
- Customer S. Lee requested a resize of image 'my-vacation-july-24-444.jpg' => Request
2014-02-12 11:41:42,612 root:QX3567187:Resize from S. Lee of 'my-vacation-july-24-444.jpg' started
2014-02-12 11:41:43,015 root:QX3567187:Resize saved in S3 entry 'lee-mvj24-3617846.jpg'
2014-02-12 11:41:43,212 root:QX3567187:Resize sent to instance EC2-Q347HN for 100 by 100 resize
2014-02-12 11:42:35,756 root:QX3567187:Resize completed by EC2-Q347HN
Set up fast queue between high-priority worker process and low-priority logging process
- Logging process does slower formatting operations
- Allocating/deallocating queue buffers must be fast
- Logger pushes to permanent storage in the background
Any object should be easily dumped to the log
logging.dump(myobj)
Products such as {{site.data.bibliography.loggly-nd.title}} integrate logs from multiple sources and analyze them.
Read the following two short sections from {{site.data.bibliography.shute2013.title}}:
-
Section 1: Introduction (pp. 1068--1069, not including "2. Basic Architecture").
-
Section 10: Latency and Throughput (p. 1078, not including "11. Related Work").
Key points: Most of the paper is concerned with database topics that are outside the scope of this course. However, the two sections I selected respond to two themes of the course:
- The relation between scalability, availability, and latency. The F1 team claims to have found a unique design point in that space.
- Latency, replication, and distribution across data centres. The Paxos algorithm they mention is a quorum algorithm that requires a quorum of instances to be available in order to run.