Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API request to access all congressional reports by the particular committee #155

Open
zymbuzz opened this issue May 8, 2024 · 7 comments
Assignees

Comments

@zymbuzz
Copy link

zymbuzz commented May 8, 2024

Hi, thanks a lot for maintaining and developing API.

I am currently learning the documentation about accessing some resources via API. However, I need help implementing one particular request via the current API functionality. Mainly, I would like to access all metadata but also text files of congressional reports by the House Committee of Ways and Means.

My first approach was to link services, but I could not filter using the committee. Alternatively, I could only rely on a search via API, where I could select all the documents from the committee. However, I wonder if the search is too reliable. The last possibility is to rely on the Congress API, which has more flexibility but, to my understanding, covers fewer sources.

I would appreciate your guidance on how you would access all congressional reports by the committee via API.

@zymbuzz zymbuzz changed the title API request to access API request to access all congressional reports by the particular commitee May 8, 2024
@zymbuzz zymbuzz changed the title API request to access all congressional reports by the particular commitee API request to access all congressional reports by the particular committee May 8, 2024
@jonquandt
Copy link
Member

jonquandt commented May 8, 2024

I would recommend using our search service. I'm not sure what you mean by

However, I wonder if the search is too reliable.

If our parsing has identified a report as from a particular committee, doing a search service request will return it.

Here is a curl that should return the results you are after:

curl -X 'POST' \
  'https://api.govinfo.gov/search' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "query": "collection:crpt committee:(ways and means)",
  "pageSize": 10,
  "offsetMark": "*",
  "sorts": [
    {
      "field": "relevancy",
      "sortOrder": "DESC"
    }
  ],
  "historical": true,
  "resultLevel": "default"
}'

It will return a set of results that look like this:

{
  "results": [
    {
      "title": "EXTENDING LIMITS OF U.S. CUSTOMS WATERS ACT",
      "packageId": "CRPT-118hrpt436",
      "granuleId": "CRPT-118hrpt436-pt2",
      "lastModified": "2024-04-08T03:50:59Z",
      "governmentAuthor": [
        "Congress",
        "House of Representatives"
      ],
      "dateIssued": "2024-04-02",
      "collectionCode": "CRPT",
      "resultLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436-pt2/summary",
      "dateIngested": "2024-04-07",
      "download": {
        "premisLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/premis",
        "txtLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436-pt2/htm",
        "zipLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/zip",
        "modsLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436-pt2/mods",
        "pdfLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436-pt2/pdf"
      },
      "relatedLink": null
    },
    {
      "title": "EXTENDING LIMITS OF U.S. CUSTOMS WATERS ACT",
      "packageId": "CRPT-118hrpt436",
      "granuleId": "CRPT-118hrpt436",
      "lastModified": "2024-04-08T03:50:59Z",
      "governmentAuthor": [
        "Congress",
        "House of Representatives"
      ],
      "dateIssued": "2024-04-02",
      "collectionCode": "CRPT",
      "resultLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436/summary",
      "dateIngested": "2024-04-07",
      "download": {
        "premisLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/premis",
        "txtLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436/htm",
        "zipLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/zip",
        "modsLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436/mods",
        "pdfLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436/pdf"
      },
      "relatedLink": null
    },
....
  ],
  "offsetMark": "AoJw4JCJiY0DPwBDUlBULTExOGhycHQzNDctQ1JQVC0xMThocnB0MzQ3",
  "count": 785
}

As you can see, the results include direct links to the different download options.

You can submit the same search request, adding the offsetMark in the previous response to get the next set of data.

This is the equivalent search in the GovInfo UI

Note, you could also use hswm00 in the committee parameter, as that is the committee's authority id.

@jonquandt jonquandt self-assigned this May 8, 2024
@zymbuzz
Copy link
Author

zymbuzz commented May 8, 2024

Thanks a lot for the explanation.

Regarding the "unreliability", I was unsure if I understood what the search exactly does and how much I could rely on it over the long term. To my understanding, the search is currently in beta. But most importantly, I was unsure if the search was filtering across the universe of documents or if it was looking for something close to the query.

To illustrate the second point, in your equivalent search in the govinfo UI, on the left-hand side, one can refine the search by selecting collections to be either congressional reports or congressional serial sets. I would have expected the search request to exclude "congressional serial sets".

Could you also explain where I can specify the committee parameter? But also, where could I find mnemonics for other committees?

@jonquandt
Copy link
Member

The beta label indicated that the Search service was still an early release. Functionally, it is production quality, but there was a possibility that it would have changes to the interface. We'll be removing that in the near future - likely in our June release. Some more info about the search service can be found in this overview article

There are some Congressional Serial Set documents that are also coded as Congressional Reports. For example, H. Rept 94-1266 is a Serial Set package that is a Congressional Report.

For more information on document types available within the Serial Set, see https://www.govinfo.gov/help/serial-set#types on the Congressional Serial Set help page.

You can see some field operators/parameters that you can use to specify more directly in the various collection help pages. Here is a list of specific metadata values for the Congressional Reports collection

In this case, committee will search against the congCommittee element in MODS, which allows for searching by name or authorityId

In the future, we may consider developing a more comprehensive/across collections list of parameters that can be referenced.

@zymbuzz
Copy link
Author

zymbuzz commented May 27, 2024

Thanks a lot for getting back to me. I managed to set up the search following your instructions.

I expect to import all the documents associated with the mentioned congressional reports. Is using API the right way, or is it better to rely on the bulk data import?

Another question is related to the time sample available. I noticed the data is only available from 1995, with some selected documents from earlier periods. So, I wonder if I should rely on the Congress API to get the earlier documents. Do you know if you use Congress API to import info into Govinfo? Are there some noticeable differences between APIs?

thanks again for your help

@jonquandt
Copy link
Member

Yes, the API is the appropriate path for this. You can either grab contents from the search service download links directly or go via the resultLink to get additional information about the documents. Note that the zipLink will contain all content and metadata files for the entire package.

Congress.gov imports Congressional Reports from GovInfo using the GovInfo API, so you shouldn't find any Congressional Reports there that you don't see on GovInfo. GovInfo uses the Congress.gov API to create the bulkdata BILLSUM and BILLSTATUS xml and retrieve some authority information for committees and individual Members - the authority information is used as part of GovInfo parsing to provide richer metadata for search and access purposes.

Generally speaking, most of the congressional content (Record, Calendars, Bill text, Law text, congressional documents) available on Congress.gov are pulled from the GovInfo API. There are certainly implementation details between the two, but a high-level difference would be the focus. Congress.gov has a focus on legislative materials primarily for the needs of Congress, while GovInfo acts as a preservation and access repository for official Government publications from all three branches. In addition to legislative publications, GovInfo publishes executive publications, like the Federal Register, Code of Federal Regulations, daily Compilation of Presidential Documents, and other executive agency publications. GovInfo also publishes opinions from the Administrative Offices of the U.S. Courts, covering a large number of federal district, bankruptcy, and appellate courts. You may want to see what's available on our help pages for additional examples.

The scope of available Congressional Reports is based on Public Law 103-40, which expanded GPO's mission to provide electronic access to Federal Government information.

We continue to make new reports available and GPO also works to increase the historical scope of a number of collections via digitization of physical publications.

@zymbuzz
Copy link
Author

zymbuzz commented Jun 10, 2024

Thanks a lot for your answer. It clarified a lot.

Could you suggest whom I could contact to find some historical documents? I am interested in the historical reports by the Ways and Means committee. Some historical documents from some committees are readily available via their website.

@jonquandt
Copy link
Member

jonquandt commented Oct 9, 2024

@zymbuzz - I apologize for missing your follow-up. We also have some historic digitized documents available in the Congressional Serial Set collection, including from the Ways and Means committee.

Here is a sample search on the GovInfo Website, but I will note that SERIALSET packages identified as Congressional Reports will also show up in the original API search, I provided - you will be able to see this based on the packageId including SERIALSET or collectionCode being "collectionCode": "SERIALSET;CRPT"

The Congressional Serial Set is in the process of digitization and ingest into GovInfo, with many more volumes remaining. Note: Congressional Serial Set packages do not have native text associated with them, though you could look at extracting the OCRed text from the PDFs that are available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants