Our Tools

OUR APPROACH

We provide a brief introduction of the methods we use to digest and process data, and what technology is involved in each step to make this project possible.

There are three main workflow stages: identifying and finding datasets in publications (process); providing agency and researcher access to information through APIs, Jupyter Notebooks and researcher dashboards (access and disseminate); and getting feedback from the user community (feedback).

How We Do It

The approaches we develop

This project is made possible through the use of the best machine learning algorithms. A team of expert reviewers are involved in validating the outputs to ensure that the results generated are accurate and reliable.

Machine Learning Algorithms
Developed based on three winning models built on machine learning and natural language processing tools

Data Validation
provides reviewers with snippets from actual publications on which the model has been run and which contain references to the dataset being searched for.

Machine Learning Algorithms 

Three Kaggle models were developed using the best Machine Learning (ML) and Natural Processing (NLP) tools to support the identification of datasets within a set of full text publications.

  • Model 1 : Deep Learning - Sentence Context - A deep learning-based approach to learn what kind of sentences have references to a data set. It is the most robust to new datasets and it evaluates all of the text within the document.
  • Model 2 : Deep-Learning - Entity Names - A deep learning-based approach that extracts names of entities from the text to classify an entity as being a dataset or not.
  • Model 3 : Pattern Marching - A rule-based approach to search for patterns in the document that are similar to a list of existing datasets.

Data Validation 

Each agency will make use of the validation tool to ensure that the ML output is correct.

The validation tool provides reviewers with snippets from actual publications on which the model has been run and which contain references to the dataset being searched for. The snippet contains a candidate phrase identified by the model, and the goal is for the validators to determine if these snippets are referring to the correct dataset.

The results will:

  1. provide insights into how well the models are identifying datasets,
  2. find other datasets that are used in conjunction with the agency datasets to study similar topics, and
  3. identify how researchers make use of each agency's data.