This post is inspired by a problem I was too busy/lazy to solve when I initially had it, but was also solved by finding an easier solution. Nevertheless, I’ve come back to it because there are times that the available easier solutions won’t go far enough to giving access to the types of data you might want to work with.
So the problem was, I wanted to have access to a store of data to practice working with Jupyter Notebook. At the time I had none to work with so I thought about a way to generate some random(ish) data to work with in order to get my practice in. This solution is kind of meta, I use Jupyter Notebook to generate data, so that I can work on the data in Jupyter Notebook.
What’s the point?
A very pertinent question. I had applied for a job in a tech company, and I was invited for a day of interviews. I was told that one of the interviews would be technical surrounding the platform being used within the company. *Gulp*. I had no idea how to prepare for that.
I came up with an idea, where if I could picture what data would be stored on the platform, I could then generate dummy records to work with so that I could perform a number of experiments on the dummy data. This would help me to conceive the types of questions that might be asked, which I could in turn propose answers to.
That day of interviews consisted of more than just the technical one, and I never quite got around to executing the dummy data generating plan due to having to prepare for all interviews and not just one. But if I were to have a similar interview in the future it could still be mighty helpful, so I’ve arrived back here and I came up with a solution using Jupyter Notebook to solve it.
An Introduction to Jupyter Notebook
My first introduction to Jupyter Notebook was in a college course. It was introduced along with Kaggle, which is a (now Google-owned) service on which data analysis competitions are hosted (I’m not too sure whether the service is more than that or not). Some entity will add a competition to the site that challenges competitors to try to find answers to questions posed surrounding a data set, which they also post.
For that course the lecturer simply wanted us to recreate a Jupyter Notebook belonging to one of the entrants to a competition about the Titanic. It was very much a monkey see, monkey do exercise. But in the seeing it was immediately obvious the raw potential that Jupyter Notebook possessed. To have the code so tightly bound to the results and presentation of same allows for notebooks to not just be powerful data visualisation and problem solving tools, they are also extremely instructional for the exact same reason. I was blown away by what I saw. Then life got in the way and it was quite a while before I was able to get back to them. But here I am now. New job, new focus, let’s solve the world with a Jupyter Notebook!
This is not a Jupyter Notebook Tutorial
To get this out of the way, this is not a Jupyter Notebook Tutorial. To be honest it’s too visual a medium to work with to do it justice with a blog based tutorial. Don’t get me wrong, I’m sure there are excellent resources out there, but for me blog based resources for Jupyter Notebooks are better confined to working through a very specific problem. People making use of such resources will, in general, be passed the learner stage of Jupyter Notebook and will be laser focusing in on a particular solution. Jupyter Notebook is best learned as a video course, so that you can see the solutions being worked on as the tutor goes, and then you recreate the work on your own system. As a recommendation, I have found this Packt course to be very instructive, although I’ve not completed it fully yet.
Who is this post aimed at?
Well, as I slip ever more into solipsism with this blog, it’s primarily as a resource for me. However, this post is generally intended for any Jupyter Notebook practitioner who wants to work with data that is simulated towards a particular type.
What is this post about?
Finally, a useful section. This post is about a Jupyter Notebook produced notebook that generates simulated data surrounding bank accounts. It could be modified and expanded in a myriad of ways to allow it to be a simulation of any other type of data. Consider this a foundational notebook onto which you can add any type of data, and remove any data already present to make it a simulation of data to fit your needs.
The notebook I’ve come up with generates a very simplified version of a bank’s account records. It generates 1000 bank accounts which it then transforms into a pandas dataframe in order to be worked with. For now the accounts all have the same name “John Smith”, but it would be relatively trivial to include a more random name generating facility for that part.
This is the solution I came up with, it’s not necessarily the best solution. But I’m basically trying to work my way towards solving as many of my data problems with Jupyter Notebook as possible, so I cut myself some slack.
At the end of the post I have attached a zip file containing the files used during this work which can be downloaded and worked with in any way you want. I’m going to add in screen shots here of the notebook and discuss them as I go in order to explain what is in the files and how to use the notebook.
The Notebook & Data
We’re now down to the meat of the matter. First let me show you a sample of the cities.txt file I used.
Here is the start of the cities.txt file. As you can see it’s a relatively standard city-per-line text file. You’ll see soon how it’s processed in the Python code in the relevant notebook. This is meant as a demonstration of how you can use text files to seed your simulated data. You’ll see later that each account has a name associated with it, I went with the lazy option for the name field and all accounts have the same name. But it could easily have been modified to read in a list of names to be randomly assigned to each account, similar to the solution you’ll see for cities here.
Next comes the start of the notebook itself:
The heading and explanatory text are in a markown cell. The next cell is a code cell with the imports required for the rest of the notebook. For the purposes of this notebook I used the md5 hash of the current time for two reasons, 1. to guarantee uniqueness (collision dependent), and 2. to introduce hashing into the notebook so that it can be used and expanded upon in other, future notebooks that would have a more appropriate need for hashing. To be clear, I know I could have used an incrementing integer as an account ID, but that’s too trivial and less useful than getting to grips with hashing.
The time and random imports should be relatively self-explanatory. Essentially, the time module will be used to work with dates that bank accounts are opened. The random module will be used to add variety to the data.
I added the math module thinking I would need it. It turns out I didn’t for this particular piece of work, but I’m leaving it there because the math module is very regularly required when working with data.
The pandas module is “a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.” It is the powerhouse that makes Jupyter Notebook so useful. Once the simulated data is generated it will be transformed into a dataframe, one of the primary data structures pandas uses to do its work.
Next we have the column names. If you’re seeking to recreate or extend this work, this is your starting point. You need to think about the data you’re trying to simulate and you then need to design it as you would a database table. Each element in the list is a column in your table, this can be as big or small as your use case demands. As you can see, I created a very basic account table design with these headings.
We move on to reading in the list of cities from the text file. What the string in city.strip(“\\s\n”) does is remove any leading or trailing white space from each line and any newline character from the end. As a sanity check at the very end I print out the list of cities, and you can see that they were successfully read in.
Next I wrote a function to be used with the dates later. What this function does is take in an integer representing a year. It then generates a shift value of seconds that a date can be amended by that will be within the number of years input from the date being calculated. This was a quick and dirty function, it is overly long for the purposes of being easily understood, but there is a much simpler way to achieve the task that would also work out as being much more accurate. For the purposes of demonstration, however, this will suffice for now.
Finally, in terms of the data generation, here is the loop that I used to generate a list of 1000 accounts. For each iteration I generate a single account, which is also a list. I then append this list onto the accounts list. As can be seen the “AccountID” will end up as a hash of the current time. The “AccountName” will be John Smith. The “City” will be randomly chosen from the cities list. The “OpeningDate” will be the current time subtracted by the number of seconds returned from the getRandomDateShift function. this spreads out account opening dates. The “OpeningBalance” is a multiple of 50 from 50 to 500. The “CurrentBalance” is randomly assigned from -1,000 to 100,000. Only accounts with overdrawal facilities can be set as negative. And finally “OverdrawalFacility” is a flag.
You can see then that two of the accounts have their details printed, again as a type of sanity check at this stage.
Now we arrive at the purpose. Conversion of this data into a dataframe so that it can be used for analysis in Jupyter Notebook. This is done in the first line, blink and you miss it, that’s how straightforward it is. The next lines are just to show you how some of the data information properties work. You can see that there are 1000 records in the “df.index”. “df.columns” shows the column names. And “df.dtypes” shows you what data types Pandas believes the columns to be.
Finally, finally we see how easy it can be to drill down into the data with a very simply line of code. The “df[“CurrentBalance”]<0″ line generates a series of booleans which are True if the condition is met and False otherwise. In this case it returns a True if the current balance on the account is negative. This series of boolean values is then used inside the brackets of df to select only the accounts that had True returned for them. So simple and yet so powerful.
A Word of Warning
As a lot of these values were generated using pseudorandom functions the spread of data is going to be relatively uniform. Therfore, this data will be relatively resistant to certain types of analysis, such as regression analysis or clustering. If you wanted to simulate such things you would have to build in biases to your data generating functions to force patterns to appear.
This data is simulated data, as a result only a simulated analysis will be possible.
A zip of the cities.txt and generatedata.ipynb files is available here.