Producing the Open Repair Alliance datasets

The Open Repair Alliance technical team does regular work on collecting and combining the data from Open Repair Alliance partners to produce the Open Repair Alliance datasets.

Here we provide an overview of this work, the purpose of which is to give an introduction of the work we do for interested partners. This is a technical overview – for more general information, please see our pages on our open data. If you’re interested in discussing this technical work in further detail, or if you’re interested in joining the alliance as a data partner, get in touch

The process

The work requires liaising with data partners to collect, clean, map and combine the repair data in order to provide it for download.  The open repair data is shared according to the Open Repair Data Standard, and much of our work is in mapping partner data to this standard.

The ORA dataset is updated regularly – normally every 6 months, although every 12 months during the pandemic.  

Liaising with data partners

A data partner is a community repair network/organisation that shares its repair data to the Open Repair Alliance for inclusion.  

We work with data partners in reviewing their repair data and figuring out how we can map it to the Open Repair Data Standard.  Each time we do produce an updated set of ORA data, we ask for the latest data from all partners.

Collecting the data

There are various ways that a data partner might be collecting their repair data.  They may have an existing software platform with which they collect data, or a spreadsheet, or they may have been collecting another way.  At present we can only work with partners who have been collecting data in an electronic format of some kind.

We accept data in several formats – Comma Separated Values (CSV), Excel spreadsheets and Google Sheets. We need to be able to import the data to a Google Sheet and export it to CSV.

With new partners we do an initial analysis of the repair data to see what work is required to include it, looking at factors such as the size of the dataset and how closely it currently matches the ORDS format.  Assuming all is well we can then work further with the data partner on a mapping of the data to the ORDS format.

For existing partners, we generally need to review and further map the data each time we receive it, as changes may have occurred in the way the partner collects the data or changes to the ORDS format.  If a partner is already providing the data in the ORDS format, then we do not have to do much work here.  Ideally all partners could provide the partner in ORDS format, but we recognise the complexity of this and help turning the data into ORDS format where we can.

Defining mapping from partner data to ORDS

Once we have received data from a partner, we then work on figuring out mappings from the partner data to the ORDS format.  The process of checking and defining these mappings takes place in Google Sheets.

Column mappings

We work on identifying which columns in partner data contain the data that corresponds to which ORDS fields.  Sometimes multiple columns in source data might correspond to one ORDS field.  Sometimes a subset of a source column might correspond to one ORDS field.

Value mappings

To date, the data provided by a data partner rarely maps exactly to ORDS.  Thus we work to determine any transformations that need applying to the source data to convert it to ORDS format.  These mappings vary from partner to partner, and some are more complicated than others.

Mappings that can be complicated 
Other mappings/calculations
  • ID
  • Partner product category (usually requires concatenating 2 tiers of category values)
  • Country (ensuring the strings map to ISO codes)
  • Event date (making sure in correct format)
  • Year of manufacture/product age (sanity checking)

As mentioned, this mapping generally has to be checked (and sometimes redone) each time a new ORA dataset is produced, as the format or content of an underlying partner dataset might have changed.

Even when a partner dataset is closely aligned to ORDS values, they still sometimes have different spellings, casing, etc – so we must always check the mappings.

Outputting ORA datasets

We currently work on a 6 to 12 month rhythm for the production of datasets.  We produce an aggregate dataset, and a dataset for each partner, available at our downloads page.

Acquiring data

We ask partners for export of the latest datasets. Data can arrive via email in CSV or Excel files, or as a link to a Google Sheet.  Currently we ask for a full dataset each time (i.e. from beginning of time to end of requested period).  We ask for the full dataset each time as partners might add, amend or delete historical data in between exports.  We maintain a unique ORDS id between exports.

Checking data

We review data each time to make sure that data is provided in a usable format, in the same format as last time, etc.

We might need to update mappings because new category values have appeared in partner data that we have not previously mapped to ORDS values, or new ORDS categories have been added.

Data cleaning

We check through and clean the data each time we receive a new set of data from a partner.

For example, we

  • Find / remove duplicate records
  • Sanitise data.  Data might not be validated during upstream data capture, and thus incorrect data might be present, and we endeavour to tidy up the data that we can.  Some issues we encounter in the data:
    • Year of manufacture: negative ages, obvious typos, suspect ages
    • Multiple items recorded in one row
    • Duplicate unique identifiers
    • Apparently duplicate data with different identifiers
    • Inappropriate data – test records, bad language, etc
    • Emojis and non-ascii characters

We also make a note of these issues and report them to our data partners.

Changes for latest version of ORDS

If the Open Repair Data Standard has changed, then we will make the corresponding amendments to the process and the software we have written to take into account any changes in ORDS.

Aggregation of datasets

We produce an interim aggregation, combining the partner datasets so that we can then carry out some final checks.

Final checks:

  • Number of records per partner, amount of change since last time
  • Number of records per category, amount of change since last time

Packaging

Finally we produce CSV and JSON files of the data, along with manifest files, and package all this together for upload to both openrepair.org and GitHub.

Communications

Once produced, we share the data first with internal partners for review.  When given the green light, we then publish a blog post on https://openrepair.org and then promote the release of the new dataset on social media.


FAQ

Could you combine the ORA data more frequently?

Given the amount of work involved in combining the data, we do not currently have the resources to combine the data more regularly.  If data partners were in a position to provide the data in ORDS format directly, more regular compilation might be possible.  That said, for the current policy use cases of the data, a more frequently updated dataset is not required.

Have you considered collecting the data via an API from partners?  This would allow for close to real-time data.

API-based collection would require data partners to provide their data via an API, and both real-time mapping to the ORDS format and data cleaning would be required.  This would be a big undertaking, given the range of software being used by partners. It would also have an impact on any changes to the ORDS format as it would require partners to update their APIs.

Have you considered exposing the data via an API? Then I could access it in real-time.

Until we collate the data in real-time (see above), we don’t see much benefit in exposing the data via an API. If you need direct access to the data for use in a script, it is available directly from Github: https://github.com/openrepair/data.

What kind of issues do you see in the data?

Even when the work of creating the mapping from a partner data structure to ORDS structure has been done, there can still be issues with the data itself – either its structure or the data quality.  Some issues we have encountered:

  • Duplicate record ids
  • No unique ids provided
  • Format of spreadsheet sent by partner changed from export to export, multiple follow up discussions required
  • Partner sent two spreadsheets with column headings in two different languages, with slightly different datasets contained within
  • Partner provided multiple spreadsheets that we need to concatenate together
  • ORDS fields need extracting from part of source field (e.g. group id from URL, event date from URL)
  • Spreadsheets with formats that do not immediately convert to CSV, e.g. with merged cells and columns such as data subsets grouped by country
  • Inconsistent upstream categorisation of data, e.g Household non-electrical ~ Toaster
  • Seemingly duplicate records with different ids (these may be valid or not)
  • Invalid values, e.g. -10, 20010, 1025 in year of manufacture
  • New types of items appear in partner data, so we have to update mappings
  • Test data
  • Nonsense data or data containing bad language
  • Data that contains sensitive information such as names and email addresses
  • Emojis and other special characters
  • Groups with incorrect country code