What’s there? What’s missing? Quick guide to understanding data completeness

When we talk about coverage or completeness, we want to know a couple of things. First, what’s there? and second, what’s missing? We want to survey the land and get a short but complete overview. How do we do this? We look at our data from more than one angle.

A map is not the territory…

Data is a tool

It represents something we’re interested in. That thing could be cars, loans, flowers, or cups. Whatever it is, we want to record or review information about it. Knowing can help us sell the right cars, guide our clients to the right loans, report on the state of the flower industry or manufacture more instragrammable cups.

A white cup in focus on a table with a blue tablecloth in the background
Reality

Data describes concepts

It represents ideas we’re sharing. There are many styles and shapes of cups in the world, but the icon of a cup is pretty much universally understood. I may not know the style or shape of your cup but I understand “cup-ness”.

Cup Icon by Design Revision
Concept

How does this help us understand completeness?

Let’s take a step back. We’re unlikely to be interested in every cup that ever existed, so we have a scope. Let’s say we’re interested in cups we make and sell. Our universe of cups is limited to just those cups.

We want to know a few things about our cups: the materials we used, how large or small they are, that sort of thing. So we decide on headings or columns for each of the attributes (information about cups) that we’re interested in.

cup schema
Schema

This list of things about cups is the schema. It’s a template that describes what we want to know about cups. It isn’t our data on cups (we’ll add that under the headings) but it gives us some direction about what to record.

Unlike the concept of a cup, the schema of a cup isn’t intuitive. We’d struggle to instantly recognise “cup-ness” by looking over this list. We’ve taken reality, abstracted it to a concept then made that into a schema which is the container for our data.

So back to completeness. When we talk about completeness, we could be talking about the concept or the schema. These are different questions but together gives us insight into the state of our cup data.

  • Concept – How many cups are we reporting?
  • Schema – How many cup attributes are we reporting?

Concepts & Schemas: How are they different?

In general, when we talk about a concept of a cup, we have a list of information we need to understand “cup-ness”. So we may agree it’s not a “cup” unless we have these things: cup#,  name and type. That’s close enough to our concept of a cup that we can ask questions about the number of cups. This is the sort of information we use to plan campaigns, make strategic decisions and launch new cups to the market.

In reality, we don’t record everything diligently. We miss things out for a host of reasons. This is even more obvious when we aren’t recording the data ourselves.

Data has gaps

Understanding where those gaps are is important. Gaps affect how we report on concepts. If we’re missing cup names, that reduces the number of cups we report. We use information about gaps to improve our data collection so that we can make better strategic and planning decisions.

Takeaway

The upshot? To understand how complete our data is, we survey our data landscape in two ways: by concepts and by schema. We can count conceptual cups or count cup attributes to find what’s there and what’s missing. The two strands help us understand what’s going on in our data.

  • Some things (or attributes) are more important than others, they map to concepts;
  • Some things are conceptual (“cup-ness”) others are schematic (the cup attributes);
  • Some things are more useful for planning and strategy (concept) and others for improving data quality (schema).

 

 

 

 

Legacy Code Rocks: Open Data with Edafe Onerhime

I just loved chatting with Andrea on the Legacy Code Rocks! podcast. Listen: Open Data with Edafe Onerhime

Edafe Onerhime is a consultant on Data Science and Data Analysis who has over 20 years of experience answering difficult questions about open data. She has helped governments, charities and businesses make better decisions and build stronger relationships by understanding, using and sharing their data. In this episode, we discuss the history of open data, its importance in building communities and its similarities to open source and open science.

Have a good open data policy

Can I Trust Your Open Data?

You want people to use your data. They want confidence that they can trust your data and rely on it, now and in the future. A good open data policy can help with that.

An open data policy sets out your commitment to your open data ecosystem. It should detail how you will collect, process, publish and share data. It will set expectations for anyone using your open data and if you stick to it, lead to confidence about what to expect.

You can create your own open data policy from the Open Data Services  open data policy template, check out the Sunlight Foundation guidelines or Socrata’s How to develop your open data policy article. Here’s some open data policies in the wild:

Remember: It’s not enough to have a policy, you have to stick it to build trust and confidence in you as an open data publisher and in your open data.

Make It Play Well With Other Data

How do I make my open data as useful as possible? How do I connect it with other data to boost insight? How do I answer really tough questions with open data? Make it play well with other data – make it interoperable.

interoperable

(ˌɪntərˈɒprəbəl)

adj

(Computer Science) of or relating to the ability to share data between different computer systems, esp on different machines: interoperable network management systems.

 Why should you care about this?

If you want your open data to help answer questions, solve problems, boost the economy by fuelling innovation or used in research, you need to go beyond names and places.
Do these mean the same company?
  • ACME
  • ACME Limited
  • A.C.M.E
How about now?
  • GB-COH-123456: ACME
  • GB-COH-123456: ACME Limited
  • GB-COH-123456: A.C.M.E
Bit more confident? You can take that code 123456* and find the company on Companies House (Hint, that’s what the GB-COH- tells the machine using your open data!). Go you, you’ve just opened up a whole new world of information! This example is using a shared standard way of talking about organisations, find out more on org-id.guide.
(* P.S This is just an example, ACME doesn’t really exist!)

Now what?

You can start to answer question like this:
Answer tough questions with good quality open data
Answer tough questions with good quality open data
These codes or Identifiers are  a gold mine. Every country has agencies that give codes to businesses, charities, non profits and more. Use those codes where you can.

Can I share codes for anything else?

Of course! You can identify places, things, categories, types and much, much more.

Tip: Make your open data more useful by making it easy to connect with other data.

See all the tips in one place: Good Quality Open Data

More on: interoperable
Courtesy of Collins English Dictionary – Complete and Unabridged, 12th Edition 2014 © HarperCollins Publishers 1991, 1994, 1998, 2000, 2003, 2006, 2007, 2009, 2011, 2014

Information commons for the UK charitable sector

Exploring how 360Giving underpins the data infrastructure for charitable grant making and how it supports an information commons for the sector. It all starts with a little clarity.

For 2 years, I helped funders open up information flow in the non-profit sector by publishing what they fund. This is powerful insight. Understanding the 360Giving data standard is crucial for more funders to adopt it.

Funders need to know:

  1. What is the data standard?
  2. What must be provided, what’s recommended and what’s optional? (and why?)
  3. How does the standard fit together?
  4. How does our data map to the standard?
  5. How can we ensure we’re telling the true story of our funding?

Supporting the standard meant creating tools, reports, and visualisations to provide clarity and provoke discussion (the standard isn’t static, so your voices as funders, data users, tech and non-profits are hugely important).

One question I hadn’t answered to my satisfaction was “How does the standard fit together?”. So I created a data visualisation to explain what 360Giving helps you share and how it’s put together to support good quality open data on funding.

 

360Giving Data Schema Visualised
360Giving Data Schema Visualised

 

With funding information shared in a similar way, charitable grant making organisations can ask & answer questions like:

  • How can we share the story of our funding?
  • Can we find partners by sharing our grant making?
  • How can we tackle our shared missions together?

Sharing data openly connects organisations. That’s why open data is the basis of a shared charitable sector information commons. Historically, the non-profit sector had it tough – no-one wanted to fund infrastructure. Here’s what Friends Provident Foundation‘s Danielle Walker-Palmour had to say at a social investment event:

No one wants to fund infrastructure – we need to think of infrastructure as a commons to achieve our sector’s collective goals.

Times and perceptions are shifting; Barrow Cadbury‘s Connect Fund is making headway investing in infrastructure for social investment. Similar initiatives are expected to follow.

A shared commons of information needs standards that make information simple to combine, easy to understand and usable by organisations of every size. The 360Giving data standard is an integral part of the commons and the sector’s data infrastructure. The goal? A shared information commons that sees more of the non-profit sector working together, seamlessly.

Here’s the original Twitter moment:

Continuous Feedback: A generic data science pipeline

A few years ago, I worked with an organisation that sells automotive intelligence to streamline the way they got insight from data. I came up with a generic data pipeline to explain to the board how their new data science process could work. It was a hit!

Visuals are a great way to explore a concept and explain a process that could otherwise lose folks along the way.

The key to a good data pipeline is it’s part of an overall process (not shown here) where you know what the problem is, why it’s important to solve it and that data is definitely going to help.

The pipeline focuses on continuous feedback – feedback at every key stage of the process. This could be to the problem owner, other teams, or any other stakeholder to keep them informed and fold their feedback back into the pipeline.

So, here’s my blast from the past – feel free to substitute out Domain Data Science step for other processes that make sense or drop it altogether; whatever works for your situation.

Generic data pipeline v1
Generic data pipeline v1