This is a quick post in relation to where you could find sources of Data to run experiments and code on. It was inspired by a talk during the week that spoke about some open data initiatives. It reminded me of my search for data when I started out, in the end I abandoned working on any of the sources in favour of taking on a course of study in Cyber Security, that’s a year I’ll never get back!
However, I had done a little groundwork, the course is over and these sources may well be of use to someone so I’ll make a quick post about them.
So first to refresh my results I performed a quick web search and came up with a different post from one of my favourite sources about the exact same topic. As per usual it’s filled with great suggestions and even contains a few of the sources I had identified. The source: freecodecamp.org, which I have been constantly impressed by ever since I came across them originally a couple of years ago. Their post on open data sources is here.
To highlight one of the sources they mention I would like to point to Kaggle. Kaggle is a data science competition site. There are regular competitions hosted there, where people with a bunch of data, publish the data and ask for members of the site to carry out analysis on it. The publisher of the data may have a specific question that they want answered and leave it up to the competitors as to how to answer it. The best feature is that the answers submitted by the competitors are then made available for other users to look through and learn from, it really is a gem of a resource.
A source that I became aware of this week is part of a drive by Microsoft to promote open data sharing generally. There are a few sources mentioned on that promotional page and you can take a look and see what piques your interest.
Another site I became aware of during the talk I mentioned above is another data science competition website: drivendata.org. I have no experience personally, but they were recommended during the talk so I might as well share. From a quick dive into it, it seems similar to Kaggle and you can even drill down into some of the answers submitted to certain competitions which will allow you to learn from them.
The most promising source of open data I had found on my search long ago was the London Datastore, it’s a set of data provided openly for public consumption.
However, forget all that, I’m going to finish off with what looks like the ultimate source of sources. The good thing about it is that as it is a Github project you can clone it and ensure you always have access to it no matter what happens to Github itself or the project in the future. It is a collection called “Awesome Public Datasets” provided by user AwesomeData, enjoy!
Now that should be enough data for the hungriest of data scientists, Sunday Quicky away!