Skip to content
Riccardo Zaccarelli edited this page Mar 27, 2019 · 19 revisions

Stream2segment

Stream2segment is a program to download, process and visualize event-based seismic waveform data (segments). It is particularly dedicated and optimized to the task of managing huge amounts of relatively short segments.

Stream2segment is based on template files. Hereafter, we will refer to:

  • download config (file name convention: download.yaml): the file used to define the parameters for the download subroutine, including the database URL where to save the data. Currently supported are PostgreSQL (installable locally or remotely) and SQLite, which saves all data in a single portable local file
  • processing module (file name convention: processing.py): the Python file where the user implements the Python code of the processing subroutine for a single segment, and optionally the functions displaying user-defined plots in the GUI
  • processing config (file name convention: processing.yaml): the (optional) configuration file of the processing module where the user implements all parameters needed to be passed therein (including which segments to process)

All configuration files are in YAML syntax, a human-readable data serialization language. Both download and processing subroutines feature a logging system printing relevant information to file or database, can be gracefully stopped at any moment with the CTRL+C key (in the download phase, a further download with the same parameter set will start from the beginning, but segments downloaded before the interruption will not be downloaded again). Finally, both download and processing subroutines display a progress bar with the available remaining time (more reliable during processing than during download, but nevertheless very useful for a rough estimation also in the latter case). In a potentially days long execution, these features are particularly useful.

Stream2segment is a command line application invokable by opening a terminal and typing s2s or stream2segment followed by a specific command representing a specific tasks (s2s download, s2s process and so on). The list of available commands are described below.

Before running any command below, if you installed the program in a Python virtual environment (hopefully you did), remember to activate the virtual environment first. For instance, if the virtual environment is installed inside the package folder, move (cd on a terminal) to the stream2segment folder and type:

source env/bin/activate

When you're finished, type deactivate on the terminal to deactivate the current pythoin virtual environment and return to the global system defined Python

Initializing

Command details
s2s init [OPTIONS] OUTDIR

Creates template files for launching download, processing and visualization. OUTDIR will be created if
it does not exist

Options:
--help  Show this message and exit.

The initial step is to create in a specified directory the template files. The specific command (see details below) will generate one download config and two processing modules (with their processing configs) for two common cases of processing (in future versions, we plan to increase the amount of processing files in order to increase the use case coverage) Any implementation detail not mentioned in this section is provided in the template files.

Downloading

Command details
s2s download [OPTIONS]

  Downloads waveform data segments with metadata in a specified database. The -c option (required) sets
  the defaults for all other options below, **which are optional**

Options:
  -c, --config FILE               The path to the configuration file in yaml format
                                  (https://learn.getgrav.org/advanced/yaml).  [required]
  -d, --dburl TEXT                Database url where to save data (currently supported are sqlite and
                                  postgresql. If postgres, the database must have been created beforehand).
                                  If sqlite, just write the path to your local file prefixed with
                                  'sqlite:///' (e.g., 'sqlite:////home/myfolder/db.sqlite'): non-absolute
                                  paths will be relative to the config file they are written in. If non-
                                  sqlite, the syntax is:
                                  dialect+driver://username:password@host:port/database E.g.:
                                  'postgresql://smith:Hw_6,@mymachine.example.org/mydb' (for info see:
                                  http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls)
  -es, --eventws TEXT             The event web service url to use. Supply a *full* url (up to and not
                                  including the first query character '?') or a path to a local file. The
                                  events list returned by the url or in the supplied file must be formatted
                                  as specified in https://www.fdsn.org/webservices/FDSN-WS-
                                  Specifications-1.1.pdf#page=16 or as isf
                                  (http://www.isc.ac.uk/standards/isf/download/isf.pdf), although the latter
                                  has limited support in this program (e.g., comments are not allowed. Use
                                  at your own risk). You can also type one of the following shortcut
                                  strings: {{ DOWNLOAD_EVENTWS_LIST }}
  -s, --start, --starttime DATE or DATETIME
                                  Limit to events (and datacenters) on or after the specified start time.
                                  Specify a date or date-time in iso-format or an integer >=0 to denote the
                                  number of days before today at midnight. Example: start=1 and end=0 =>
                                  fetch events occurred yesterday.
  -e, --end, --endtime DATE or DATETIME
                                  Limit to events (and datacenters) on or before the specified end time.
                                  Specify a date or date-time in iso-format or an integer >=0 to denote the
                                  number of days before today at midnight. Example: start=1 and end=0 =>
                                  fetch events occurred yesterday.
  -n, --network, --networks, --net TEXT
                                  Limit the search to the specified networks (see 'channel' parameter for
                                  details).
  -z, --station, --stations, --sta TEXT
                                  Limit the search to the specified stations (see 'channel' parameter for
                                  details).
  -l, --location, --locations, --loc TEXT
                                  Limit the search to the specified locations (see 'channel' parameter for
                                  details).
  -k, --channel, --channels, --cha TEXT
                                  Limit the search to the specified channels (if missing, defaults to '*',
                                  i.e.: accept all channels) Wildcards '?' and '*' are recognized
                                  (https://www.fdsn.org/webservices/FDSN-WS-Specifications-1.1.pdf), as well
                                  as the operator '!' placed as first character to indicate logical NOT.
                                  Example: "!B*,BBB" accepts all channels NOT starting with "B" OR the
                                  channel "BBB"
  -msr, --min-sample-rate FLOAT   Limit the search to channels with at least the following sample rate (in
                                  Hz). The relative segments will *mot likely* (but not always) match the
                                  channel sample rate. Set to 0 or negative number to ignore the sampling
                                  rate
  -ds, --dataws TEXT              data-select web service to use (url). It *must* be FDSN compliant:
                                  <site>/fdsnws/dataselect/<majorversion>/query otherwise the station query
                                  can not be retrieved automatically (the site scheme is optional and will
                                  default to 'http://' in case. An ending '/' or '?' will be removed from
                                  the url, if present). You can also type two special values: "iris"
                                  (shortcut for: https://service.iris.edu/fdsnws/dataselect/1/query) or
                                  "eida" (which will automatically fetch data from the urls of all EIDA
                                  datacenters).
  -t, --traveltimes-model TEXT    The model to be used to asses the travel times of a wave from the event
                                  location to each station location. Type a string denoting a file name
                                  (absolute path) of a custom model created by means of `s2s utils ttcreate`
                                  or one of the 4 built-in models (all assuming receiver depth=0 for
                                  simplicity): ak135_ttp+: ak135 model pre-computed for all ttp+ phases (P
                                  wave arrivals) ak135_tts+: ak135 model pre-computed for all tts+ phases (S
                                  wave arrivals) iasp91_ttp+: iasp91 model pre-computed for all ttp+ phases
                                  (P wave arrivals) iasp91_tts+: iasp91 model pre-computed for all tts+
                                  phases (S wave arrivals) For each segment, the arrival time (travel time +
                                  event time) will be the pivot whereby the user sets up the download time
                                  window (see also 'timespan').
  -w, --timespan FLOAT...         The segment's time span (i.e., the data time window to download): specify
                                  two positive floats denoting the minutes to account for before and after
                                  the calculated arrival time. Note that 3.5 means 3 minutes 30 seconds, and
                                  that each segment window will be eventually rounded to the nearest second
                                  to avoid floating point errors when checking for segments to re-download
                                  because of a changed window.
  -u, --update-metadata           Update segments metadata, i.e. overwrite the data of already saved
                                  stations and channels. Metadata include the station inventories (see
                                  'inventory' for details). This parameter does not affect new stations and
                                  channels, which will be saved on the db anyway
  -r1, --retry-url-err            Try to dowonload again already saved segments with no waveform data
                                  because of a general url error (e.g., no internet connection, timeout,
                                  ...)
  -r2, --retry-mseed-err          Try to dowonload again already saved segments with no waveform data
                                  because the response was malformed, i.e. not readable as MiniSeed
  -r3, --retry-seg-not-found      Try to dowonload again already saved segments with no waveform data
                                  because not found in the response. This is NOT the case when the server
                                  returns no data with an appropriate 'No Content' message, but when a
                                  successful response (usually '200: OK') does not contain the expected
                                  segment data. E.g., a multi-segment request returns some but not all
                                  requested segments.
  -r4, --retry-client-err         Try to dowonload again already saved segments with no waveform data
                                  because of a client error (response code in [400,499])
  -r5, --retry-server-err         Try to dowonload again already saved segments with no waveform data
                                  because of a server error (response code in [500,599])
  -r6, --retry-timespan-err       Try to download again already saved segments with no waveform data because
                                  the response data was completely outside the requested time span (see
                                  'timespan' for details)
  -i, --inventory [true|false|only]
                                  Download station inventories (xml format). Inventories will be downloaded
                                  and saved on the db for all stations that have saved segments with data.
                                  If the metadata should not be updated (see 'update_metadata') already
                                  saved inventories will not be downloaded again. You can always download
                                  inventories later by providing "only" as value (wihtout quotes): this will
                                  skip all other download steps (and ignore all other parameters values
                                  except 'update_metadata')
  -minlat, --minlatitude FLOAT    (eventws query argument) Limit to events with a latitude larger than or
                                  equal to the specified minimum
  -maxlat, --maxlatitude FLOAT    (eventws query argument) Limit to events with a latitude smaller than or
                                  equal to the specified maximum
  -minlon, --minlongitude FLOAT   (eventws query argument) Limit to events with a longitude larger than or
                                  equal to the specified minimum
  -maxlon, --maxlongitude FLOAT   (eventws query argument) Limit to events with a longitude smaller than or
                                  equal to the specified maximum
  --mindepth FLOAT                (eventws query argument) Limit to events with depth more than the
                                  specified minimum
  --maxdepth FLOAT                (eventws query argument) Limit to events with depth less than the
                                  specified maximum
  -minmag, --minmagnitude FLOAT   (eventws query argument) Limit to events with a magnitude larger than the
                                  specified minimum
  -maxmag, --maxmagnitude FLOAT   (eventws query argument) Limit to events with a magnitude smaller than the
                                  specified maximum
  --help                          Show this message and exit.

The download routine can be started by editing the parameters of the download config and running the relative command. The routine fetches the requested events and then searches for stations and channels available (in case of network error, the data is fetched from the database, if any): the corresponding parameter accepts FDSN compliant URLs or the special words “eida” or “iris”. In the former case, being EIDA a federated organization of data centers, a so-called Routing Service is used as in most download tools in order to fetch the URLs of all available data centers and purge potentially duplicated stations returned by more than one data center. The search of channels can be tuned with a parameter controlling the minimum sampling rate and with constraint parameters (network, station, location and channel) which additionally accept the leading character “!” representing exclusion (e.g. “!A*”). Note that a station is internally uniquely identified by the tuple (network code, station code, start time), meaning that the same physical station closed and reopened later is saved as two different station entities in our database. This also allows to handle the relative inventories separately (which might be different). Events, stations and channels are saved to the database: already existing events are not overridden, stations and channels are configurable (by default they are not overridden).

With all the list of events and stations, for each event epicentre the program finds iteratively the nearby stations by means of a circle area whose configurable radius can be constant or magnitude dependent: this results in a list of potential segments to be downloaded. For any of these segments the time at which the segment event reaches the segment’s station (arrival time) is efficiently calculated by interpolation on a regular grid of distance and source depth and pre-computed (by means of obspy functions) travel times. The grid can be created with the command s2s utils ttcreate by setting model name, phases and a maximum error tolerance, or can be configured by supplying the name of one of four pre-computed grids. With the configurable (arrival time dependent) window every information is now available in order to download the waveform data segments from the relative data center URLs. Given the amount of data to fetch compared to station and event requests, this is the more demanding step. Therefore, stream2segment packs together all segments belonging to the same time-window and data center, and queries all segments at once, reducing the number of connections and running each request in a parallel thread to further optimize blocking IO- operations.

Even more important in this stage is a detailed tracking of the download results: massive downloads should be generally performed more than once with the same parameters set, as network problems which might be solved in a further download are likely to happen. For each requested waveform, stream2segment saves a code denoting the download state, either issued by the server (data center) or by reading all successfully received waveforms by means of an efficient diagnostic module originally implemented for SeisComP3. This results in a broad and configurable spectrum of cases whereby it is possible to retry the download of already saved segments . The diagnostic also returns several information hold in the waveform header which a RDBMS can efficiently store to perform powerful segments selections in the processing phase: maxgap numsamples, data seed id, sample rate (which might differ from the one stored on the associated station’s channel), start time and end time. The last two parameters are particularly important: as the time window of the received segment data might not match the requested time window (RTW). All miniSEED records (chunk of data) outside the RTW will be discarded, avoiding saving useless data. The optional configurable last step is the download of the station inventories (xml format) which will be saved (compressed to optimize storage size) only for stations who have at least a downloaded segment with data. Depending on the configuration, already downloaded inventories will be skipped (the default) or overridden. Once downloaded, data can be processed by invoking the command:

Processing

Command details
s2s process [OPTIONS] [OUTFILE]

  Processes downloaded waveform data segments via a custom python file and a configuration file.

  [OUTFILE] (optional): the path of the .csv file where the output of the user-defined processing function
  F will be written to (one row per processed segment); all logging information, errors or warnings will
  be written to the file [OUTFILE].[now].log (where [now] denotes the current utc date-time in iso
  format). If this argument is missing, then the output of F (if any) will be discarded, and all logging
  messages will be saved to the file [pyfile].[now].log

Options:
  -d, --dburl TEXT or PATH      Database url where to save data (currently supported are sqlite and
                                postgresql. If postgres, the database must have been created beforehand). If
                                sqlite, just write the path to your local file prefixed with 'sqlite:///'
                                (e.g., 'sqlite:////home/myfolder/db.sqlite'): non-absolute paths will be
                                relative to the config file they are written in. If non-sqlite, the syntax
                                is: dialect+driver://username:password@host:port/database E.g.:
                                'postgresql://smith:Hw_6,@mymachine.example.org/mydb' (for info see:
                                http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls). It
                                can also be the path of a yaml file containing the property 'dburl' (e.g.,
                                the config file used for downloading)  [required]
  -c, --config FILE             The path to the configuration file in yaml format
                                (https://learn.getgrav.org/advanced/yaml).  [required]
  -p, --pyfile FILE             The path to the python file where the user-defined processing function is
                                implemented. The function will be called iteratively on each segment
                                selected in the config file  [required]
  -f, --funcname TEXT           The name of the user-defined processing function in the given python file.
                                Optional: defaults to 'main' when missing
  -a, --append                  Append results to the output file (this flag is ignored if no output file is
                                provided): 'append' means also that the program will first scan the output
                                file to detect already processed segments and skip them. When missing, it
                                defaults to false, meaning that an output file, if provided, will be
                                overridden if it exists
  --no-prompt                   Do not prompt the user when attempting to overwrite an existing output file.
                                This flag is false by default, i.e. the user will be asked for  confirmation
                                before overwriting an existing file. This flag is ignored if no output file
                                is provided, or the 'append' flag is given
  -mp, --multi-process          Use parallel sub-processes to speed up the execution. When missing, it
                                defaults to false
  -np, --num-processes INTEGER  The number of sub-processes. If missing, it is set as the the number of CPUs
                                in the system. This option is ignored if --multi-process is not given
  --help                        Show this message and exit.

Once downloaded, data can be processed by invoking the relative command (see below for details) which needs a user-defined processing config and a processing module, usually created by editing the template files.

The processing config accepts any kind of parameter to configure the execution, including the selection of suitable segments via a parameter named segment_select which accepts a list of fields (the columns of each segment database row and related tables, e.g. channel, station, data center, event), associated to a value given in simple string expression, which will be converted by the program in the corresponding SQL syntax.

The processing module is a Python module with a single mandatory function:

def main(segment, config)

which will define the code to be executed on each selected segment. If the optional output.csv argument is provided, the function might return a list or dictionary of values which will be written as row of the CSV file. Nevertheless, there are no restrictions on what can be implemented: the idea of the processing module is simply to release the user from all unnecessary burden, and provide the flexibility of RDBMS with no need of SQL knowledge. Everything implemented in the processing config is exposed to the user via the config argument in the form of a Python dictionary, whereas the segment argument is a simple Python object representing the currently processed segment. Its attributes return all sort of segment's data (e.g. segment.arrival_time, segment.max_duration_ratio, segment.has_data) including related database entities (e.g., segment.event, segment.station) in the form of other Python objects. Also provided are several methods (e.g., segment.stream(), segment.inventory()) returning objects for working with ObSpy.

A documented list of all these properties is provided in all templates generated with s2s init. Stream2segment implements also several math utilities and functions, which can be listed and printed on the screen via the command s2s utils mathinfo. All functions are a complement to the rich set of utilities of the ObsPy library (installed with the package). The user can also execute the whole process subroutine into several Python sub-processes to fully leverage multiple processors on a given machine.

Visualizing

Command details
s2s show [OPTIONS]

  Shows raw and processed downloaded waveform's plots in a browser

Options:
  -d, --dburl TEXT or PATH  Database url where to save data (currently supported are sqlite and postgresql.
                            If postgres, the database must have been created beforehand). If sqlite, just
                            write the path to your local file prefixed with 'sqlite:///' (e.g.,
                            'sqlite:////home/myfolder/db.sqlite'): non-absolute paths will be relative to
                            the config file they are written in. If non-sqlite, the syntax is:
                            dialect+driver://username:password@host:port/database E.g.:
                            'postgresql://smith:Hw_6,@mymachine.example.org/mydb' (for info see:
                            http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls). It can
                            also be the path of a yaml file containing the property 'dburl' (e.g., the
                            config file used for downloading)  [required]
  -c, --configfile FILE     Optional: The path to the configuration file in yaml format
                            (https://learn.getgrav.org/advanced/yaml).
  -p, --pyfile FILE         Optional: The path to the python file with the plot functions implemented
  --help                    Show this message and exit.

Stream2segment can also visualize processed (or raw) data and meta-data in a GUI (remotely as web portal or locally in the web browser) with any kind of user-implemented plots. The visualization subroutine is somehow related to the processing, as in many cases a user might want to use it to debug or inspect the processing results. Therefore, the GUI can be opened via the relative command (see below for details) with the same arguments used for processing.

In this case, processing module will define the the plots to be visualized via a decorator @gui.plot attachable to any user-defined function, and an optional pre-processing function via the decorator @gui.preprocess. Both functions share the same signature of the main function defined above and should return data displayable as a plot (e.g., numeric arrays).

The processing config will define several parameters which will customize the GUI: e.g., define the segment's signal and noise windows, or the labels (displayed via check boxes) for annotating segments in the framework of supervised machine learning applications, or to simply assign categories which will be saved to the database and can be then selected to process segments subsets).

Utilities

Finally, stream2segment implements several utilities whose commands are detailed below.

Command details
s2s utils [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  dreport   Returns download information for inspection in either plain text
            or html format. For details, type s2s utils dreport --help
  dstats    Produces download summary statistics in either plain text or html
            format. For details, type s2s utils dstats --help
  mathinfo  Prints on screen quick help on stream2segment built-in math
            functions. For details, type s2s utils mathinfo --help
  ttcreate  Creates a travel time table for computing travel times (via linear
            or cubic interpolation, or nearest point) in a *much* faster way
            than using obspy routines directly for large number of points.
            For details, type s2s utils ttcreate --help