OpenAPI is increasingly recognized as the preferred framework among API developers for detailing their API functionalities across various platforms. These specifications are commonly hosted on platforms like GitHub, GitLab, and SwaggerHub, or are directly available on the service provider’s website.
We focused on extracting API specifications from GitHub and SwaggerHub. Each platform presents unique challenges. On GitHub, the primary difficulty lies in sifting through millions of YAML and JSON files to pinpoint those that are OpenAPI specifications. Additionally, GitHub’s API has stringent rate limits, necessitating slow queries to avoid exceeding these limits. Despite these hurdles, we have successfully collected approximately one million API specifications, covering the histories of over 200,000 specifications. SwaggerHub, on the other hand, does not offer a dedicated API for retrieving specifications. We circumvented this by manipulating the existing APIs and varying parameters to systematically extract data, amassing over 500,000 specifications to date.
To extend our dataset, we also sourced specifications from APIs.guru’s curated dataset and the GitHub archive via BigQuery. In total, we have gathered just over 1.4 million OpenAPI specifications, forming a comprehensive library of API documentation, that is continuously updated with newly fetched specifications.
Our processing pipeline operates in parallel across both existing and newly acquired files. Initially, during the parsing phase, we assess each file’s parsability. If a file is deemed parsable, it moves to the bundling phase. This phase involves resolving all external references, which may be either relative or absolute paths. Relative paths are resolved only if the specification originates from GitHub. For absolute paths, there is a risk of indefinitely waiting for responses from certain servers; therefore, we have established a timeout period for fetching these dependencies. Once the file is self contained it is validated against the version of specification in which it is written.
The processing pipeline concludes by running the files through a metrics calculator, which calculates and extracts a set of metrics from OpenAPI that describe four specific aspects of an API.
We expose the dataset though a dashboard available at: http://openapi.inf.usi.ch/
You can find my presentation slides at MSR 2024 here: https://usi365-my.sharepoint.com/:p:/g/personal/serbos_usi_ch/EUlt8pg3qo5FgVPx5ftInBUBjIco_dJU32T4CN4mk18XjA?rtime=IfzOURZe3Eg
Paper: https://souhaila-serbout.me/pdfs/MSR2024-Serbout-Pautasso-APIstic.pdf