This library is meant to serve as a convenience library to using the PDS. This came about from me rewriting the same code over and over.
This is hosted in artifactory.huit. You can directly install it with pip with:
pip install --index-url https://artifactory.huit.harvard.edu/artifactory/api/pypi/ats-python/simple pds
If you have the dependency in a requirements file, the extra-index-url`` needs to be set up in the config. Note that you should set
extra-index-urland NOT
index-url` as that will overwrite your index (probably pypi.org).
pip config set global.extra-index-url https://artifactory.huit.harvard.edu/artifactory/api/pypi/ats-python/simple
pip install -r requirements.txt
While it's not recommended, you can also pull directly from source with the following:
If you're a member of github.huit, this can be installed locally, directly from this repository with:
pip install git+https://github.huit.harvard.edu/HUIT/[email protected]
However, if this is being deployed, you can't do that. (A docker image building this will most likely not be authenticated to github.huit as you are locally.) This will eventually make it to artifactory, but right now, it is being mirrored on public github, so you can use this in your requirements.txt
:
pds @ git+https://github.com/harvard-huit/[email protected]
Create a People object using an apikey and what you want your batch size to be (if you're paginating).
import pds
pds_api = pds.People(apikey='12345', batch=50)
Default batch is 50 and this won't work without a valid apikey, so if you don't have one, head over here: https://portal.apis.huit.harvard.edu/docs/ats-person-v3/1/overview
Once that object is created, there are only a couple of methods for it.
Args:
- apikey (str): your apikey
- batch_size (int): the size of your batches. Defaults to 50. 1000 is the max allowed by the API.
- retries (int): The number of retries each request will have. Defaults to 3.
people = pds.People(apikey=os.getenv('APIKEY'), batch_size=1000)
People Class Attributes:
apikey
(str): The API key for accessing the PDS API.batch_size
(int): The number of results to return per API call.retries
(int): The number of times to retry an API call if it fails.environment
(str): The environment to use for the PDS API (dev, test, stage, or prod).is_paginating
(bool): Whether or not the API is currently paginating results.pagination_type
(str): The type of pagination to use (queue or session).result_queue
(queue.Queue): A queue for storing paginated results.max_backlog
(int): The maximum number of results to return.results
(list): A list of all results returned by the API.count
(int): The number of results returned by the last API call.total_count
(int): The total number of results available for the last API call.session_id
(str): The ID of the current pagination session.
- search: the core search method
- next: the manual method to make a call for next result in a pagination session
- start_pagination: create a pagination session to page through results
- wait_for_pagination: blocks processing until pagination is done. Synonymous to using
start_pagination
with thewait
parameter - next_page_results: gets next batch of results
This is the core function. It takes a dict query
and an optional boolean paginate
,
import pds
pds_api = pds.People(apikey-'12345', batch=50)
people = pds_api.search(query={
"fields": [
"univid"
],
"conditions": {
"names.name": "jazahn"
}
})
total_results = people['total_results']
results = people['results']
This just converts some results into dotmaps, which can sometimes make referencing values easier.
import pds
pds_api = pds.People(apikey-'12345', batch=50)
people = pds_api.search(query={
"fields": [
"univid"
],
"conditions": {
"names.name": "jazahn"
}
})
total_results = people['total_results']
results = people['results']
dotmapped_people = pds_api.make_people(results)
print(dotmapped_people[0].univid)
Next is probably the reason you're using this library. This helps simplify pagination. Simply make a search call with the paginate
boolean set to true and then you can call next()
to get the next set.
import pds
pds_api = pds.People(apikey-'12345', batch=50)
response = pds_api.search(query={
"fields": [
"univid"
],
"conditions": {
"names.name": "jazahn"
}
}, pagination=True)
total_results = response['total_results']
results = response['results']
while(True):
response = pds_api.next()
if not response:
break
results = response['results']
# do something with results
Starts a new pagination session.
Args:
- query (str): The query to search for.
- type (str): The type of pagination to use (
queue
orlist
). Defaults to queue. - wait (bool): Whether or not to wait for the pagination session to complete.
- max_backlog (int): The maximum number of results to store in memory. This is ignored if
wait
isTrue
and defaults to 5000. (It will go past this in order to keep the pagination session alive, but the time in between requests slows down significantly.)
start_pagination(query=query)
people.start_pagination(query=query, type='list', wait=True)
all_results = people.results
Blocks the thread until pagination is finished and returns a boolean indicating whether there is more to do.
Returns:
bool
: True
if there are results, False
otherwise.
people.start_pagination(query=query)
people.wait_for_pagination()
This is identical to:
people.start_pagination(query=query, wait=True)
Returns the next batch of results from the API, based on the pagination type.
If pagination_type is 'queue', returns the next item in the result_queue
.
If pagination_type is 'list', returns the next batch of results from the results
list.
If the pagination process is currently running, this method will block until it gets the next page result.
If the method has been running for longer than session_timeout
minutes, an error is logged and the method exits.
Returns:
list
: The next batch of results from the API, or an empty list if there are no more results.
import pds
people = pds.People(apikey=os.getenv('APIKEY'), batch_size=1000)
query = {
"fields": ["univid"],
"conditions": {
"names.name": "john"
}
}
try:
people.start_pagination(query)
while True:
results = people.next_page_results()
logger.info(f"doing something with this batch of {len(results)} results")
if len(results) < 1 and not people.is_paginating:
break
except Exception as e:
logger.error(f"Something went wrong with the processing. {e}")
The pagination process has a few ways it can be used. Synchronously, asynchronously and producing a queue of batches or a list.
Using next()
is good, but this pagination process was created with async operations in mind. For example, if you're processing 100k records, you'll be getting a max of 1000 records and if you're doing something with them that might take longer than 3 min, the PDS pagination process will time out before you get to call next()
again, which would force you to start it over again.
Please note that getting a lot of records and holding them before processing could have an effect on memory. If you have the memory to hold all the records you're getting, you can just do this:
import pds
people = pds.People(apikey=os.getenv('APIKEY'), batch_size=1000)
query = {
"fields": ["univid"],
"conditions": {
"names.name": "john"
}
}
people.start_pagination(query, type='list', wait=True)
people_list = people.results
That will give you the full list.
import pds
people = pds.People(apikey=os.getenv('APIKEY'), batch_size=1000)
query = {
"fields": ["univid"],
"conditions": {
"names.name": "john"
}
}
try:
people.start_pagination(query)
results = people.next_page_results()
logger.info(f"doing something with this batch of {len(results)} results")
if len(results) < 1 and not people.is_paginating:
break
except Exception as e:
logger.error(f"Something went wrong with the processing. {e}")