We provide a brief introduction of the methods we use to digest and process data, and what technology is involved in each step to make this project possible.
There are three main workflow stages: identifying and finding datasets in publications (process); providing agency and researcher access to information through APIs, Jupyter Notebooks and researcher dashboards (access and disseminate); and getting feedback from the user community (feedback).
This project is made possible through the use of the best machine learning algorithms. A team of expert reviewers are involved in validating the outputs to ensure that the results generated are accurate and reliable.
Machine Learning Algorithms Developed based on three winning models built on machine learning and natural language processing tools
Data Validation provides reviewers with snippets from actual publications on which the model has been run and which contain references to the dataset being searched for.
Machine Learning Algorithms
Three Kaggle models were developed using the best Machine Learning (ML) and Natural Processing (NLP) tools to support the identification of datasets within a set of full text publications.
Data Validation
Each agency will make use of the validation tool to ensure that the ML output is correct.
The validation tool provides reviewers with snippets from actual publications on which the model has been run and which contain references to the dataset being searched for. The snippet contains a candidate phrase identified by the model, and the goal is for the validators to determine if these snippets are referring to the correct dataset.
The results will: