Behind the scenes of a Zindi contest

Ever wondered what goes into launching a data science competition?

Step 1: Inspiration

Step 2: Show me the data

  • Be shareable. Anything confidential needs to be masked or removed, and you either need to own the data or have permission to use it. For the birdsong challenge, we used data that had CC licences but we still made sure to get permission from xeno-canto and check that we’re following all the licence terms (such as attribution and non-modification).
  • Be readable. This means no proprietary formats, variable definitions, sensible column names, and ideally a guide for reading in the data.
  • Be manageable. Some datasets are HUGE! It’s possible to organize contests around big datasets, but it’s worth thinking about how you expect participants to interact with the data. Remember — not everyone has fast internet or free storage.
  • Be useful. This isn’t always easy to judge, which is why doing data exploration and building a baseline model early on is important. But ideally, the data has some predictive power for the thing you’re trying to model!
Visualizing Birdsongs
  • Collection: For an organization, this would be an ongoing process. In our example case, this meant scraping the website for files that met our criteria (Southern African birds) and then downloading tens of thousands of mp3 files.
  • Cleaning: A catch-all term for getting the data into a more usable form. This could be removing unnecessary data, getting rid of corrupted files, combining data from different sources…
  • Splitting and Masking: We picked the top 40 species with the most example calls, and then split the files for each species into train and test sets, with 33% of the data kept for the test set. Since the file names often showed the bird name, we used ‘’.join(random.choices(string.ascii_uppercase + string.digits, k=6)) to generate random IDs. However you approach things, you’ll need to make sure that the answers aren’t deducible from the way you organize things (no sorting by bird species for the test set!)
  • Checking (and re-checking, and re-checking): Making sure everything is in order before launch is vital — nothing is worse than trying to fix a problem with the data after people have started working on your competition! In the checking process I discovered that some mp3s had failed to download properly, and others were actually .wav files with .mp3 as the name. Luckily, I noticed this in time and could code up a fix before we went live.

Step 3: Getting ready for launch

A confusion matrix — one way to quickly see how well a classification algorithm is working. (from the starter notebook)
Fine-tuning the benchmark model

Try it yourself!

About the author

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Zindi

Zindi

347 Followers

Zindi hosts the largest community of African data scientists, working to solve the world’s most pressing challenges using machine learning and AI.