-
Notifications
You must be signed in to change notification settings - Fork 8
Home
Stream2segment is a program to download, process and visualize event-based seismic waveform data (segments). It is particularly dedicated and optimized to the task of managing huge amounts of relatively short segments.
Stream2segment is based on template files. Hereafter, we will refer to:
-
download config
(file name convention:download.yaml
): the file used to define the parameters for the download subroutine, including the database URL where to save the data. Currently supported are PostgreSQL (installable locally or remotely) and SQLite, which saves all data in a single portable local file -
processing module
(file name convention:processing.py
): the Python file where the user implements the Python code of the processing subroutine for a single segment, and optionally the functions displaying user-defined plots in the GUI -
processing config
(file name convention:processing.yaml
): the (optional) configuration file of theprocessing module
where the user implements all parameters needed to be passed therein (including which segments to process)
All configuration files are in YAML syntax, a human-readable data serialization language. Both download and processing subroutines feature a logging system printing relevant information to file or database, can be gracefully stopped at any moment with the CTRL+C key (in the download phase, a further download with the same parameter set will start from the beginning, but segments downloaded before the interruption will not be downloaded again). Finally, both download and processing subroutines display a progress bar with the available remaining time (more reliable during processing than during download, but nevertheless very useful for a rough estimation also in the latter case). In a potentially days long execution, these features are particularly useful.
Stream2segment is a command line application invokable by opening a terminal and typing s2s
or stream2segment
followed by a specific command representing a specific tasks (s2s download
, s2s process
and so on). The list of available commands are described below.
Command details
s2s init [OPTIONS] OUTDIR
Creates template files for launching download, processing and visualization. OUTDIR will be created if
it does not exist
Options:
--help Show this message and exit.
The initial step is to create in a specified directory the template files. The specific command (see details below) will generate one download config
and two processing module
s (with their processing config
s) for two common cases of processing (in future versions, we plan to increase the amount of processing files in order to increase the use case coverage) Any implementation detail not mentioned in this section is provided in the template files.
Command details
s2s download [OPTIONS]
Downloads waveform data segments with metadata in a specified database. The -c option (required) sets
the defaults for all other options below, **which are optional**
Options:
-c, --config FILE The path to the configuration file in yaml format
(https://learn.getgrav.org/advanced/yaml). [required]
-d, --dburl TEXT Database url where to save data (currently supported are sqlite and
postgresql. If postgres, the database must have been created beforehand).
If sqlite, just write the path to your local file prefixed with
'sqlite:///' (e.g., 'sqlite:////home/myfolder/db.sqlite'): non-absolute
paths will be relative to the config file they are written in. If non-
sqlite, the syntax is:
dialect+driver://username:password@host:port/database E.g.:
'postgresql://smith:Hw_6,@mymachine.example.org/mydb' (for info see:
http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls)
-es, --eventws TEXT The event web service url to use. Supply a *full* url (up to and not
including the first query character '?') or a path to a local file. The
events list returned by the url or in the supplied file must be formatted
as specified in https://www.fdsn.org/webservices/FDSN-WS-
Specifications-1.1.pdf#page=16 or as isf
(http://www.isc.ac.uk/standards/isf/download/isf.pdf), although the latter
has limited support in this program (e.g., comments are not allowed. Use
at your own risk). You can also type one of the following shortcut
strings: {{ DOWNLOAD_EVENTWS_LIST }}
-s, --start, --starttime DATE or DATETIME
Limit to events (and datacenters) on or after the specified start time.
Specify a date or date-time in iso-format or an integer >=0 to denote the
number of days before today at midnight. Example: start=1 and end=0 =>
fetch events occurred yesterday.
-e, --end, --endtime DATE or DATETIME
Limit to events (and datacenters) on or before the specified end time.
Specify a date or date-time in iso-format or an integer >=0 to denote the
number of days before today at midnight. Example: start=1 and end=0 =>
fetch events occurred yesterday.
-n, --network, --networks, --net TEXT
Limit the search to the specified networks (see 'channel' parameter for
details).
-z, --station, --stations, --sta TEXT
Limit the search to the specified stations (see 'channel' parameter for
details).
-l, --location, --locations, --loc TEXT
Limit the search to the specified locations (see 'channel' parameter for
details).
-k, --channel, --channels, --cha TEXT
Limit the search to the specified channels (if missing, defaults to '*',
i.e.: accept all channels) Wildcards '?' and '*' are recognized
(https://www.fdsn.org/webservices/FDSN-WS-Specifications-1.1.pdf), as well
as the operator '!' placed as first character to indicate logical NOT.
Example: "!B*,BBB" accepts all channels NOT starting with "B" OR the
channel "BBB"
-msr, --min-sample-rate FLOAT Limit the search to channels with at least the following sample rate (in
Hz). The relative segments will *mot likely* (but not always) match the
channel sample rate. Set to 0 or negative number to ignore the sampling
rate
-ds, --dataws TEXT data-select web service to use (url). It *must* be FDSN compliant:
<site>/fdsnws/dataselect/<majorversion>/query otherwise the station query
can not be retrieved automatically (the site scheme is optional and will
default to 'http://' in case. An ending '/' or '?' will be removed from
the url, if present). You can also type two special values: "iris"
(shortcut for: https://service.iris.edu/fdsnws/dataselect/1/query) or
"eida" (which will automatically fetch data from the urls of all EIDA
datacenters).
-t, --traveltimes-model TEXT The model to be used to asses the travel times of a wave from the event
location to each station location. Type a string denoting a file name
(absolute path) of a custom model created by means of `s2s utils ttcreate`
or one of the 4 built-in models (all assuming receiver depth=0 for
simplicity): ak135_ttp+: ak135 model pre-computed for all ttp+ phases (P
wave arrivals) ak135_tts+: ak135 model pre-computed for all tts+ phases (S
wave arrivals) iasp91_ttp+: iasp91 model pre-computed for all ttp+ phases
(P wave arrivals) iasp91_tts+: iasp91 model pre-computed for all tts+
phases (S wave arrivals) For each segment, the arrival time (travel time +
event time) will be the pivot whereby the user sets up the download time
window (see also 'timespan').
-w, --timespan FLOAT... The segment's time span (i.e., the data time window to download): specify
two positive floats denoting the minutes to account for before and after
the calculated arrival time. Note that 3.5 means 3 minutes 30 seconds, and
that each segment window will be eventually rounded to the nearest second
to avoid floating point errors when checking for segments to re-download
because of a changed window.
-u, --update-metadata Update segments metadata, i.e. overwrite the data of already saved
stations and channels. Metadata include the station inventories (see
'inventory' for details). This parameter does not affect new stations and
channels, which will be saved on the db anyway
-r1, --retry-url-err Try to dowonload again already saved segments with no waveform data
because of a general url error (e.g., no internet connection, timeout,
...)
-r2, --retry-mseed-err Try to dowonload again already saved segments with no waveform data
because the response was malformed, i.e. not readable as MiniSeed
-r3, --retry-seg-not-found Try to dowonload again already saved segments with no waveform data
because not found in the response. This is NOT the case when the server
returns no data with an appropriate 'No Content' message, but when a
successful response (usually '200: OK') does not contain the expected
segment data. E.g., a multi-segment request returns some but not all
requested segments.
-r4, --retry-client-err Try to dowonload again already saved segments with no waveform data
because of a client error (response code in [400,499])
-r5, --retry-server-err Try to dowonload again already saved segments with no waveform data
because of a server error (response code in [500,599])
-r6, --retry-timespan-err Try to download again already saved segments with no waveform data because
the response data was completely outside the requested time span (see
'timespan' for details)
-i, --inventory [true|false|only]
Download station inventories (xml format). Inventories will be downloaded
and saved on the db for all stations that have saved segments with data.
If the metadata should not be updated (see 'update_metadata') already
saved inventories will not be downloaded again. You can always download
inventories later by providing "only" as value (wihtout quotes): this will
skip all other download steps (and ignore all other parameters values
except 'update_metadata')
-minlat, --minlatitude FLOAT (eventws query argument) Limit to events with a latitude larger than or
equal to the specified minimum
-maxlat, --maxlatitude FLOAT (eventws query argument) Limit to events with a latitude smaller than or
equal to the specified maximum
-minlon, --minlongitude FLOAT (eventws query argument) Limit to events with a longitude larger than or
equal to the specified minimum
-maxlon, --maxlongitude FLOAT (eventws query argument) Limit to events with a longitude smaller than or
equal to the specified maximum
--mindepth FLOAT (eventws query argument) Limit to events with depth more than the
specified minimum
--maxdepth FLOAT (eventws query argument) Limit to events with depth less than the
specified maximum
-minmag, --minmagnitude FLOAT (eventws query argument) Limit to events with a magnitude larger than the
specified minimum
-maxmag, --maxmagnitude FLOAT (eventws query argument) Limit to events with a magnitude smaller than the
specified maximum
--help Show this message and exit.
The download routine can be started by editing the parameters of the download config
and running the relative command. The routine fetches the requested events and then searches for stations and channels available (in case of network error, the data is fetched from the database, if any): the corresponding parameter accepts FDSN compliant URLs or the special words “eida” or “iris”. In the former case, being EIDA a federated organization of data centers, a so-called Routing Service is used as in most download tools in order to fetch the URLs of all available data centers and purge potentially duplicated stations returned by more than one data center. The search of channels can be tuned with a parameter controlling the minimum sampling rate and with constraint parameters (network, station, location and channel) which additionally accept the leading character “!” representing exclusion (e.g. “!A*”). Note that a station is internally uniquely identified by the tuple (network code, station code, start time), meaning that the same physical station closed and reopened later is saved as two different station entities in our database. This also allows to handle the relative inventories separately (which might be different). Events, stations and channels are saved to the database: already existing events are not overridden, stations and channels are configurable (by default they are not overridden).
With all the list of events and stations, for each event epicentre the program finds iteratively the nearby stations by means of a circle area whose configurable radius can be constant or magnitude dependent: this results in a list of potential segments to be downloaded. For any of these segments the time at which the segment event reaches the segment’s station (arrival time) is efficiently calculated by interpolation on a regular grid of distance and source depth and pre-computed (by means of obspy functions) travel times. The grid can be created with the command s2s utils ttcreate by setting model name, phases and a maximum error tolerance, or can be configured by supplying the name of one of four pre-computed grids. With the configurable (arrival time dependent) window every information is now available in order to download the waveform data segments from the relative data center URLs. Given the amount of data to fetch compared to station and event requests, this is the more demanding step. Therefore, stream2segment packs together all segments belonging to the same time-window and data center, and queries all segments at once, reducing the number of connections and running each request in a parallel thread to further optimize blocking IO- operations.
Even more important in this stage is a detailed tracking of the download results: massive downloads should be generally performed more than once with the same parameters set, as network problems which might be solved in a further download are likely to happen. For each requested waveform, stream2segment saves a code denoting the download state, either issued by the server (data center) or by reading all successfully received waveforms by means of an efficient diagnostic module originally implemented for SeisComP3. This results in a broad and configurable spectrum of cases whereby it is possible to retry the download of already saved segments . The diagnostic also returns several information hold in the waveform header which a RDBMS can efficiently store to perform powerful segments selections in the processing phase: maxgap numsamples, data seed id, sample rate (which might differ from the one stored on the associated station’s channel), start time and end time. The last two parameters are particularly important: as the time window of the received segment data might not match the requested time window (RTW). All miniSEED records (chunk of data) outside the RTW will be discarded, avoiding saving useless data. The optional configurable last step is the download of the station inventories (xml format) which will be saved (compressed to optimize storage size) only for stations who have at least a downloaded segment with data. Depending on the configuration, already downloaded inventories will be skipped (the default) or overridden. Once downloaded, data can be processed by invoking the command:
Command details
s2s process [OPTIONS] [OUTFILE]
Processes downloaded waveform data segments via a custom python file and a configuration file.
[OUTFILE] (optional): the path of the .csv file where the output of the user-defined processing function
F will be written to (one row per processed segment); all logging information, errors or warnings will
be written to the file [OUTFILE].[now].log (where [now] denotes the current utc date-time in iso
format). If this argument is missing, then the output of F (if any) will be discarded, and all logging
messages will be saved to the file [pyfile].[now].log
Options:
-d, --dburl TEXT or PATH Database url where to save data (currently supported are sqlite and
postgresql. If postgres, the database must have been created beforehand). If
sqlite, just write the path to your local file prefixed with 'sqlite:///'
(e.g., 'sqlite:////home/myfolder/db.sqlite'): non-absolute paths will be
relative to the config file they are written in. If non-sqlite, the syntax
is: dialect+driver://username:password@host:port/database E.g.:
'postgresql://smith:Hw_6,@mymachine.example.org/mydb' (for info see:
http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls). It
can also be the path of a yaml file containing the property 'dburl' (e.g.,
the config file used for downloading) [required]
-c, --config FILE The path to the configuration file in yaml format
(https://learn.getgrav.org/advanced/yaml). [required]
-p, --pyfile FILE The path to the python file where the user-defined processing function is
implemented. The function will be called iteratively on each segment
selected in the config file [required]
-f, --funcname TEXT The name of the user-defined processing function in the given python file.
Optional: defaults to 'main' when missing
-a, --append Append results to the output file (this flag is ignored if no output file is
provided): 'append' means also that the program will first scan the output
file to detect already processed segments and skip them. When missing, it
defaults to false, meaning that an output file, if provided, will be
overridden if it exists
--no-prompt Do not prompt the user when attempting to overwrite an existing output file.
This flag is false by default, i.e. the user will be asked for confirmation
before overwriting an existing file. This flag is ignored if no output file
is provided, or the 'append' flag is given
-mp, --multi-process Use parallel sub-processes to speed up the execution. When missing, it
defaults to false
-np, --num-processes INTEGER The number of sub-processes. If missing, it is set as the the number of CPUs
in the system. This option is ignored if --multi-process is not given
--help Show this message and exit.
Once downloaded, data can be processed by invoking the relative command (see below for details) which needs a user-defined processing config
and a processing module
, usually created by editing the template files.
The processing config
accepts any kind of parameter to configure the execution, including the selection of suitable segments via a parameter named segment_select
which accepts a list of fields (the columns of each segment database row and related tables, e.g. channel, station, data center, event), associated to a value given in simple string expression, which will be converted by the program in the corresponding SQL syntax.
The processing module
is a Python module with a single mandatory function:
def main(segment, config)
which will define the code to be executed on each selected segment. If the optional output.csv
argument is provided, the function might return a list or dictionary of values which will be written as row of the CSV file. Nevertheless, there are no restrictions on what can be implemented: the idea of the processing module
is simply to release the user from all unnecessary burden, and provide the flexibility of RDBMS with no need of SQL knowledge. Everything implemented in the processing config
is exposed to the user via the config
argument in the form of a Python dictionary, whereas the segment
argument is a simple Python object representing the currently processed segment. Its attributes return all sort of segment's data (e.g. segment.arrival_time
, segment.max_duration_ratio
, segment.has_data
) including related database entities (e.g., segment.event
, segment.station
) in the form of other Python objects. Also provided are several methods (e.g., segment.stream()
, segment.inventory()
) returning objects for working with ObSpy.
A documented list of all these properties is provided in all templates generated with s2s init
. Stream2segment implements also several math utilities and functions, which can be listed and printed on the screen via the command s2s utils mathinfo
. All functions are a complement to the rich set of utilities of the ObsPy library (installed with the package). The user can also execute the whole process subroutine into several Python sub-processes to fully leverage multiple processors on a given machine.
Command details
s2s show [OPTIONS]
Shows raw and processed downloaded waveform's plots in a browser
Options:
-d, --dburl TEXT or PATH Database url where to save data (currently supported are sqlite and postgresql.
If postgres, the database must have been created beforehand). If sqlite, just
write the path to your local file prefixed with 'sqlite:///' (e.g.,
'sqlite:////home/myfolder/db.sqlite'): non-absolute paths will be relative to
the config file they are written in. If non-sqlite, the syntax is:
dialect+driver://username:password@host:port/database E.g.:
'postgresql://smith:Hw_6,@mymachine.example.org/mydb' (for info see:
http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls). It can
also be the path of a yaml file containing the property 'dburl' (e.g., the
config file used for downloading) [required]
-c, --configfile FILE Optional: The path to the configuration file in yaml format
(https://learn.getgrav.org/advanced/yaml).
-p, --pyfile FILE Optional: The path to the python file with the plot functions implemented
--help Show this message and exit.
Stream2segment can also visualize processed (or raw) data and meta-data in a GUI (remotely as web portal or locally in the web browser) with any kind of user-implemented plots. The visualization subroutine is somehow related to the processing, as in many cases a user might want to use it to debug or inspect the processing results. Therefore, the GUI can be opened via the relative command (see below for details) with the same arguments used for processing.
In this case, processing module
will define the the plots to be visualized via a decorator @gui.plot
attachable to any user-defined function, and an optional pre-processing function via the decorator @gui.preprocess
. Both functions share the same signature of the main
function defined above and should return data displayable as a plot (e.g., numeric arrays).
The processing config
will define several parameters which will customize the GUI: e.g., define the segment's signal and noise windows, or the labels (displayed via check boxes) for annotating segments in the framework of supervised machine learning applications, or to simply assign categories which will be saved to the database and can be then selected to process segments subsets).
Finally, stream2segment implements several utilities whose commands are detailed below.
Command details
s2s utils [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
dreport Returns download information for inspection in either plain text
or html format. For details, type s2s utils dreport --help
dstats Produces download summary statistics in either plain text or html
format. For details, type s2s utils dstats --help
mathinfo Prints on screen quick help on stream2segment built-in math
functions. For details, type s2s utils mathinfo --help
ttcreate Creates a travel time table for computing travel times (via linear
or cubic interpolation, or nearest point) in a *much* faster way
than using obspy routines directly for large number of points.
For details, type s2s utils ttcreate --help
This program is released under the GNU GENERAL PUBLIC LICENSE Version 3
-
Research article: Riccardo Zaccarelli, Dino Bindi, Angelo Strollo, Javier Quinteros and Fabrice Cotton. Stream2segment: An Open‐Source Tool for Downloading, Processing, and Visualizing Massive Event‐Based Seismic Waveform Datasets. Seismological Research Letters (2019) https://doi.org/10.1785/0220180314
-
Software: Zaccarelli, Riccardo (2018): Stream2segment: a tool to download, process and visualize event-based seismic waveform data. V. 2.7.3. GFZ Data Services. http://doi.org/10.5880/GFZ.2.4.2019.002