What questions can we ask? Will this data help solve our problem? Can we use this algorithm or that one?
Welcome to data wrangling 101. Exploring our data before we dive in and start playing with it or reshaping it means more productive data science or data analysis. If you’re lucky, you know enough about the domain to understand the quirks a dataset throws your way or you have someone to badger. On your own with an unfamiliar dataset? That happens too. So here’s 3 lessons from wrangling the Arts Council England 2018-2022 national portfolio dataset.
First, a little bit about Arts Council England:
Arts Council England is a public body supporting arts and culture in England. It is funded by public funds from the UK government and the National Lottery. Between 2015 and 2018, it will invest £1.8 billion in arts, museums and libraries. The funds will support art and culture experiences including theatre, digital art, reading, dance, music, literature, crafts and collections.
Why on earth are we interested in the national portfolio dataset?
The National Portfolio programme supports organisations considered by Arts Council England to represent the best of global arts practice. Funding is given over multiple years, currently 3. Between 2015 and 2018, £1 billion will be invested in 663 organisations.
That’s a lot of money and lot of prestige! I’m still exploring the dataset but here’s what I’ve learned so far.
Lesson 1: Test your assumptions
My first assumption was a bust. One thing it’s usefuk to know is “Which fields make the data unique?”. This helps us report on stuff like “How many grants were issued by the Arts Council?” and “To how many organisations?”. It was easy to jump in at first glance and say the organisation’s name, the Applicant Name. Unfortunately, an organisation can be awarded under multiple funds.
Ah OK, so maybe Applicant Name and the type of fund, the Funding Band? At first that worked great but then 1 rogue entry popped up… It turns out that most of the time, an organisation gets 1 grant, sometimes 2 but Tyne & Wear Archives & Museums got 3!
The upshot? Test your assumptions. This might be an anomaly or it might be legitimate. We can’t always tell, so we’re going to have to ask.
Lesson 2: Don’t be afraid to ask
📢 Data isn’t a perfect reflection of the real world.
When we collect, share or use data, we curate it. We make decisions about what and how much detail to include. We can’t assume that data is perfect, so sometimes we have to ask the hard questions like “Why was Tyne & Wear Archives & Museums awarded 3 grants?”
Other oddities cropped up in the data that needed that human touch. Arts Council England share a lot of geographic information. Check out what you can find:
- Local Authority
- ACE Region
- ONS Region
They’re all slightly different. Some are clearly internal like ACE Region and others are official geographies like ONS Region. But what about Area? I was stumped, so I asked the very friendly Arts Council England support team.
Here’s what I heard back:
I have heard back from our Digital Team and they advised that the ‘area’ column on the sheet attached by the person making the enquiry refers to Arts Council areas, these are:
- London – comprising NUTS 1 region of London
- Midlands – comprising NUTS 1 regions of East Midlands and West Midlands
- North – comprising NUTS 1 regions of North East, North West and Yorkshire and the Humber
- South East – comprising NUTS 1 regions of East of England and South East (excluding the county of Hampshire, and Unitary authorities of Isle of Wight, Portsmouth and Southampton)
- South West – comprising NUTS 1 regions of South West plus the county of Hampshire, and Unitary authorities of Isle of Wight, Portsmouth and Southampton
More information on the areas can be found here: http://www.artscouncil.org.uk/about-us/your-area
The organisations labelled National are certain Sector Support Organisations with a national remit.
The NUTS 1 region which each organisation is located in can be found in the column headed ‘ONS region.’
Hope this helps.
Ah, that’s really handy to know. If we need to, we can map Area to Nomenclature of Units for Territorial Statistics (NUTS) regions or decide if we know enough about geography from other columns and can ignore Area.
The upshot? Don’t be afraid to ask. Making assumptions can come back to bite you. If you can, ask someone who knows so you understand their design choices. You don’t have to do this for every single column, focus on the ones that are most likely to solve your problem. You can also come back as you iterate. Remember, it’s a cycle.
Lesson 3: Remember it’s a cycle
There are a few methodologies, good practices and guidelines that help you punch through the worst bits of data wrangling so you can get to the good bits. You might be data mining or predicting or deep learning. No matter your intended application, you’ll most likely be iterating – going around in a cycle of try, test, understand till you have a good enough answer.
When you first start working with data it can seem overwhelming. Remembering it’s a cycle will keep you sane. You might miss things the first time, that’s OK. That’s why we test and iterate.
I started exploring the the Arts Council England 2018-2022 national portfolio dataset to answer a friend’s question and then to streamline my practice. Along the way I made assumptions, backtracked, tried data visualisations that didn’t work and rolled my eyes – a lot. Each iteration, I learned something new and useful about the story of national portfolio funding for the next 3 years. I hope you have too.