Python Distributions Package, Testing and Github

Introduction

So, I’m starting from the end on this one, I have the work done, but I’m going to document some important elements of how to go about creating a Python package, how to run some tests to ensure that it works and how to store it on Github. First I’m going to start with Github, then I’ll work on some testing and finally we’ll discuss the code.

The above are the general steps that should be taken before undertaking any development, the best practice steps. But I took them out of order. I had the code written, which I then verified with some tests and now I’m going to create a repository on Github to keep them safe, the opposite of best practice.

So for the purpose of clarity I’ll list the steps in order of when they should be done. Bear in mind that when it comes to some Github steps that they aren’t once off steps. You should commit often and ideally each commit should relate to a discrete piece of functionality. You should also push to the remote repository regularly, but less regularly than you commit to your local repository.

The Steps:
  1. Create local Github repository
  2. Add files to repository and make your first commit
  3. Create remote repository on Github or Bitbucket or whatever you use
  4. Connect the local repository to the created remote repository
  5. Write some tests based on how your code should perform
  6. Commit
  7. Write some code to pass the tests
  8. Commit
  9. Repeat steps 4 to 7 until your code does everything you need it to
  10. Push to your remote master repository (this step to be used often as well)

Some caveats before we go any further. This is far from a full Git lesson, Git is a complex system which, when used properly, can save your bacon. Go learn it properly if you see the benefits, a decent introductory course is available on Udacity. This post introduces the concept of “test-driven development“, this is a style of programming where the tests are written before any of the code. The basic premise of this style of programming is that any sufficiently complex program is going to need to be tested, these tests are more cost-effective if they are written from the outset. This post is also far from a full test-driven development lesson, again it’s too complex to fully cover in a blog post. The Wikipedia article linked to above should provide a good jumping off point if you’re interested in moving further into that arena. Final caveat, this article introduces some concepts around object oriented programming, again it’s too complex to be fully covered here. Think of this article as a taster menu, if you like an item, buy in to the full thing…

The Plan

The title gives a clue, but since I haven’t fully laid out what the plan is with this post let me do it here. The Udacity course I mentioned in my introductory post (program website) goes into a bit of detail in relation to probability distributions. It also briefly covers all of the topics I will be discussing, even more briefly, here.

On the topic of object oriented programming there is a small project to build up a Python package to work with Gaussian and Binomial Distributions. It starts with the Gaussian code. Then it progresses to create more general distribution code which gets used by the Gaussian code. Then it adds some Binomial code which also uses the more general distribution code.

What I’m going to do here is briefly outline the various steps and bring us up to the point where we have a (enter your choice of version control here, surprise surprise mine will be Github) repository containing the package we created.

Warning: The code below is not great, it may even be buggy, it is unlikely that you will ever really seriously use it. I’m making this post to document the steps I took so I can follow in my own footsteps in the future, but also to outline the general process of best practice repository creation and test driven development. There exists far better and more complete code available to work with distributions through some of the more popular mathematical Python packages, for example see SciPy’s statistics package. All that being said, let’s go!

Git Repository and Remote Github Origin

First of all, what is Git and is it any different to Github? To answer this let me pose to you some questions. Let’s say you want to make a piece of software as complicated as the browser you’re using to look at this web page. I’m assuming you’re using something like Chrome/Edge/Firefox/Safari, but you could be using something as basic as Lynx, a text only terminal-based browser. Even Lynx would have a substantial amount of code to run the text only browser. The question is this: How do you manage the source code in a way that let’s you trace when and by whom changes were made to the code? If you realise that a change you made caused an error, how do you change back to a previous version?

The easiest answer to this is: Version Control. Version control let’s you manage the code used to build a piece of software, it even allows you to control who gets to contribute code to the software. Git is one version control solution, there are others but they’re beyond the scope of this article. Version control systems allow you to create multiple versions, or branches, of the same software to add different features into each separate branch. Eventually these different branches get merged into the main branch, in an ideal world. Sometimes branches get abandoned and removed. To answer the question above (is Git different to Github), yes they are different, read on to find out how.

So, “what is Github?” I hear you ask. Git runs on a local machine. Each independent contributor will have their own complete version of the software code. They use Git to manage their own code. How do they collaborate? Enter Github! Github is an online repository storage that allows for management of collaborators to a project. Git allows for all contributors to a project to push their code to Github, at which time a senior person in the project will have responsibility as to how and when these pieces of pushed code get merged into the main source code. “Putting your existing work on GitHub can let you share and collaborate in lots of great ways.” [1]

So in terms of how to start with Git and Github (as mentioned previously bitbucket is an alternative to Github, and there are more, you don’t have to use Github, but I am). First, I would recommend you set up a Github account, it will save time. This guide gives a nice summary of how to initialise a local Git repository and then to push it to an online repository through your Github account.

In order to get this done you have to create a Git repository on the machine you’re using, this is the local repository. You also have to create a repository on your Github account, this will be the remote repository. When creating a local repository, you need to have a file to add to the repository in order to make a commit, traditionally that is the README.md file. Here is an example of what the Git commands look like when you create the local repository:

Figure 1: Local Git repository being created

So now we create the remote repository, which we’ll call the same name as the directory we’re working out of locally, although you don’t have to. The steps:

Figure 2: Click “New Repository” button on your Github account
Figure 3: Add some repository details

The repository name is the only mandatory input on this page. You can choose whether or not to make the code public or private. In the guide linked to earlier, they recommend to avoid adding a README, gitignore or a licence at this stage, so just click “Create repository” when you’ve entered a name and an optional description. The page that comes up next pretty much explains the next steps to be conducted on the local repository to connect it to the remote. All you need to know is that you need the address for the remote repository, this will be near the top and if you click on the clipboard to the right of the text box containing the address it will copy the address for you:

Figure 3: Remote repository address, click clipboard to copy

Next you tell the local Git repository to treat the Github repository as the remote origin. This means that the local repository can push code to and pull code from the Github repository. Then you can make your first push, it should all look like this:

Figure 4: Configuring remote origin and making first push

I won’t bring this any further except to remind you that you should make regular commits when you change any files. Again, this is not a full git lesson, there are umpteen great sources to learn from, I’ve linked to one earlier. This is probably not even a good Git primer, but I did try.

Test Driven Development

This section of the article is a biggie. It’s a huge concept. It’s a huge return on investment. In saying that I’m not going to try to sell you on it, there are plenty of resources out there already taking care of that. What I will say is that if you’re starting out in development I strongly recommend you take the time, no matter what language you use, to figure out how to get started with Test Driven Development.

Although it’s a huge concept, it’s very easy to explain the basis of it. The purest version of it is that before you write any code to do anything, you write the tests first. I hear you say: but how can I write tests for code that doesn’t exist yet? Well for a beginner programmer that’s a fair question. If you have a bit of experience under the belt, you shouldn’t be asking that question. However, for the benefit of the beginner I’ll answer it anyway. Test Driven Development is something that would be difficult to do until you get to a certain point in the journey of learning to program. You really need to get to a point where you start creating functions/methods which start to do some work. By the time you’re at that point, before you write a function/method you know that for a certain input that you will get a certain output. Therefore, you can write a piece of code to call the function with known input/output combinations. This code is the test code. If you write this test code before you write the function you’re on the path towards Test Driven Development.

Now you’re not alone on this path, for every language there exist testing tools and frameworks that you can use to undertake this journey. They are usually very well documented and can be set up very easily. In the Udacity course the language we used was Python, and the testing tool used was very trivial to set up. The tool is called pytest. See the getting started guide for steps to set it up.

$ pip install -U pytest
$ pytest --version

Enter the two commands above and you should have installed the pytest tool. If you don’t have pip installed, then you won’t, look that up yourself: duckduckgo search.

This is where the magic happens. To get going, as explained in the getting started guide, all you have to do is create a file that matches either “*_test.py” or “test_*.py”, run pytest and it will run the tests found inside any files matching those patterns. Let’s create some dummy test file to test pytest, we’ll call the file test_distributions.py because we will later expand it to include tests of the code we’ll be writing later. The first test is as follows:

def test_pytest():
    assert True

And when you run pytest you should get something similar to below:

Figure 5: pytest output for our first test

You are unlikely to find an easier way to set up Test Driven Development, there may be tools and frameworks just as easy, but easier? I doubt it.

Gaussian Distribution Code

I’m moving on now to the Gaussian Distribution code. I won’t be detailing all the code here. I will, however, have some snippets here, the starting code and some sample tests. If you want to see the full code you can visit the Github repository.

But first we have to write our tests. Before we write our tests we have to at least have an idea of how we intend to implement our code. We will first be creating a class called Gaussian. In the next section we’ll cover higher level Object Oriented topics, but for now we’ll concentrate on just one class.

The first decision to make is, what data will the class store? And to answer we have to think about what characteristics a Gaussian Distribution has. It should be able to store the list of measurement data. From the data measurements you can calculate the mean and the standard deviation. So our class will have three member variables, a data list, a mean and a standard deviation.

It’s important to note here that we’re going for a quick and dirty approach, although demonstrating elements of best practice, we’ll stop short of validation. What I mean by that is that we won’t be ensuring that the data list contains only valid numbers, we will just be assuming it does. We’re not trying to create a perfect solution, because great solutions already exist. We’re just learning about the process.

Now we have to decide how to initialise a new instance of the class. I’m going for an empty data list, with the mean defaulting to 0 and the standard deviation defaulting to 1. Now that we know what our code is expected to do we can write some tests. My initial tests are below:

  1 from Gaussian import Gaussian
  2 
  3 def test_pytest():
  4     assert True
  5 
  6 def test_Gaussian_default_values():
  7     myGauss = Gaussian()
  8     assert myGauss.mean == 0
  9     assert myGauss.stdev == 1
 10     assert len(myGauss.data) == 0
 11 
 12 def test_Gaussian_specified_values():
 13     myGauss = Gaussian(mean = 2, stdev = 3)
 14     assert myGauss.mean == 2
 15     assert myGauss.stdev == 3
 16     assert len(myGauss.data) == 0

We start by importing the Gaussian file containing the Gaussian class. In order for this not to throw an error, I created and empty Gaussian.py file. With the exception of the test we created earlier the rest of the code will fail hard. This is because we have written no code yet.

Figure 6: Tests exploded!

The exception was thrown anyway, this is because there is no class yet. Lets add a single line to the Gaussian.py file, the class definition, to see what happens:

1 class Gaussian():
Figure 7: New Errors!

Believe it or not, these failures are Test Driven Development wins! As we get closer to the solution more and more of the tests will pass. We have beaten about this bush enough though. The first set of errors (Figure 6) were because the class didn’t exist. The second set (Figure 7) is because the class is empty, let’s add our constructor (__init__ in Python) to see if we can pass these tests.

  1 class Gaussian():
  2 
  3     def __init__(self, mean = 0, stdev = 1):
  4         self.mean = mean
  5         self.stdev = stdev
  6         self.data = []

Constructor added, let’s try our tests again.

Figure 8: Tests passed

Our tests were passed with the addition of the constructor, happy days!

Right I’m going to add new tests from the course and the full Gaussian code to pass them. The tests will look different from above on Github, like I said we’re doing quick and dirty, the course did them more correctly. All the code will be available on Github for you to pore over.

You should have a flavour now of how a simple class definition works in Python, but more importantly how it is possible to write tests before any code. It becomes an arms race between your tests and your code. What you should really be aiming for with Test Driven Development is complete code coverage and testing edge cases. Complete code coverage means your complete set of tests should execute every line of code at least once. If a line is missed out, it doesn’t necessarily mean that it is or isn’t working, it just means that it wasn’t tested. In making sure it is covered as part of a test, you can at least say that it works in certain circumstances. With the edge cases, you are actually taking an adversarial position against your code. You’re trying to break it with your tests, so that you have to fix the code to address the breakage. Fully tested code does not ensure that your code is bug free, it just ensures that it works for the test cases you thought to test.

We’re next going to discuss some Object Oriented concepts, which will lead to the breaking out of some code from Gaussian into a more general Distributions class and the reuse of that code in a Binomial class. Doesn’t that sound like fun!

Object Oriented Development

There are entire modules in college courses devoted to the topic of Object Oriented Development or Object Oriented Programming (OOP). Countless books exist detailing the topic. Suffice to say, this blog post will not be going into so much detail. Wikipedia is yet again a good jumping off point on the topic of OOP, if you want to go deeper than the discussion here.

What I do want to briefly discuss here are the “Principles of Object Oriented Programming” (introduced to me as the pillars of OOP) and then focus in specifically on one of the principles.

I was also going to discuss a bit of history in relation to the evolution of programming languages and how the computer science came to adopt OOP. However, from a scan of the history section of the Wikipedia article it’s apparent that it’s way more complex and dates much further back than I would be able to do any justice to in a couple of paragraphs.

Before I give my half baked explanations of the principles let me first direct you to an article on the freecodecamp.org site that explains them more thoroughly. This article points out something that I also wanted to highlight for anyone who might not have been aware. The principles of OOP is an often asked interview question for budding professional coders. Indeed, when I heard about them first it was from a lecturer who stated that she felt she had gotten her first role as a programmer in no small part because she had a solid understanding of the principles. In effect, they are important principles, and if you have intentions to work as a developer, make sure you can explain them.

There are four principles to be discussed; 1. Encapsulation, 2. Abstraction, 3. Polymorphism and 4. Inheritance.

But first, what are objects? Objects are data structures used when programming. They can be though of as a way of representing real world objects, although they are not limited to that use. Another term that is often used interchangeably with object is class. We have already looked at a part of a class we called Gaussian. In this article we looked at the constructor which set a number of values that were part of the class e.g. self.mean. Objects have values associated with them, these are commonly referred to as member or instance variables, there are other terms used. Objects also have functions or actions that they can perform on themselves or that can be called by other parts of a program, these functions are usually referred to as methods. If we take an example of an employee in Figure 9 below it has member variables of employee_number, name, salary and work_tasks, a list of work assigned to the employee. Next comes the list of methods, beginning with the constructor method Employee(int, string, float) which will take the values that the member variables are to be initialised as. There are methods to get and set each of the member variables, although how often you’ll be changing an Employee’s name and number is close to never, but bear with me for the sake of the example. Finally there are the methods of Give_Bonus(float), Assign_Work(Work), Assign_Work(Work [ ]) and Do_Work(Work).

Figure 9: Employee Class

Encapsulation relates to a kind of wrapper that goes around an object. It wraps around the inner state and methods of an instance of a class. It ensures that the inner state is kept private from other instances, instances of other objects or any other part of the program that you don’t want to have access. In the Employee example the salary of each employee can only be obtained by using the get method to access same. Likewise, salary is protected from being directly changed by the set method. The pluses and minuses beside the variable and class names are an indicator of whether they are publicly or privately accessible. And it is through that behaviour that the internal state is kept private, and only by utilising the public methods can you see or change the internal state. For example the public method set_salary() will change the employee’s salary.

Abstraction is an extension, of sorts, to encapsulation. It relates to the way that the object carries out its functionality. All that is surfaced to a part of the program that calls an instance to do a piece of work are the public methods. The way that the method carries out the work is hidden away from the calling code in what is called abstraction. With the Employee a part of the program can call Assign_Work but that calling code won’t know how the Employee class records the work or how it performs the work, because both the work_tasks list and the Do_Work method are private.

To take abstraction a step further, in a later iteration of the code the way the work is stored or the way the Employee does work can be completely changed, but as far as the calling code is concerned it works exactly the same because it just calls the same old Assign_Work method.

Inheritance is the ability to use an already existing class as a base class for further classes. A base class that gets inherited is often called the super or parent class. The class that does the inheriting is called a sub or child class. In an inheritance scenario a child class will have all the attributes and methods of the parent class, but it can extend its features by adding more attributes and methods. With our Employee example we might want to write a Manager class, we can inherit the Employee class because a Manager will still need an employee_number, a name and a salary. A manager will also still have work_tasks, however, a manager will delegate some of the work to Employees.

Polymorphism is probably the hardest concept to get of the four principles, partly because when you first come across the term, it’s likely the first time you’ve ever heard it and your brain is too busy wrestling with the word to actually concentrate on the concept. The basic feature of Polymorphism is that it is a way of having more than one method with the same name which behave differently. When you think of the Employee/Manager scenario when the Boss calls the Manager’s Assign_Work method, the Manager will not behave the same way as the Employee. For example, in reality a Manager will assess the work and see if it is Manager worthy or Employee worthy and if Employee worthy will delegate it to someone else. For this to be able to happen a Manager class will need to have its own version of the Assign_Work method. When a child class has its own version of a method from a parent class, it is called overriding. The Child class overrides the functionality of the parent class. Overriding of methods is also called dynamic polymorphism.

There is also a static polymorphism. this is when a class can have two methods with the same name which are distinguished from each other by the number and type of parameters passed to the method when calling it. This is also called overloading. For example in the Employee class we have two versions of the Assign_Work method which allows for an employee to be assigned only one task, or a list of tasks. Depending on the work being assigned by the calling code (single task or list), one or other of these methods will be utilised.

Some languages implement these principles more formally than others. For example Python doesn’t do encapsulation very well because it doesn’t have private variables. The inner state of an instance of a class is an open book, you can use an underscore, but it doesn’t really do much.

Inheritance

For the purposes of the overall article we are going to focus on Inheritance. The reason for this is, that if you go to the Github repository began as part of this post right now, you will see a set of classes. A parent Distribution class and two child classes Gaussian and Binomial, and since we’re not discussing every implementation step through the post I wanted to at least explain the purpose of inheritance a bit more than above.

We described earlier what inheritance is in relation to OOP. Let us now discuss the purpose of inheritance. Inheritance allows for reducing the amount of code in a program by reusing the shared code and logic from a parent class that will be common across child classes. In the employee/manager example above this code reuse was limited. Inheritance can be more complex than the employee/manager example. I’ll show a couple of diagrams to illustrate the concept starting with the Employee/Manager class diagram.

Figure 10: UML Inheritance Diagram Employee/Manager

With Figure 10 the arrow pointing at Employee means that it is the class being inherited from. This might be confusing since it looks like the Employee class is higher than the Manager class, but this is just the convention for diagramming classes in UML. The unfilled arrow notation can be read signifies inheritance and can be read as “is a” i.e. a Manager is an Employee. But it also means extends, so the Manager class will extend the attribute and functionality of the Employee class. The manager class doesn’t list the attributes and methods of the Employee class because it doesn’t need to, by way of inheritance it is automatically given these attributes and methods. As you can see the only attributes and methods listed for Manager are types of attributes and methods that a Manager would have that an ordinary Employee would not. Rather than re-invent the wheel for the more complex class diagram I am going to use an existing Shape based class diagram next.

Figure 11: Shape Class Diagram[2]

As you can see from Figure 11 by the time you get down to a Square or a specific Triangle type you will have already inherited quite a bit of code. To put that into context for our overall blog post point, let’s see what a class diagram for the finished distributions classes will look like.

Figure 12: Class Diagram for the Distributions covered in this blog post

The diagram in Figure 12 is based on the code that is presented in the Udacity course. I would have designed it slightly differently, but it satisfies the requirements as it is. There are are a number of methods that are common between the two child distributions, but because they operate in completely different ways the methods themselves could not be added to the parent class. However, to somewhat solve this, there is a facility to make a class an abstract class. An abstract class can’t be instantiated, and since the Distribution class in this example never is it could be turned into an abstract class. The reason abstract classes can’t be instantiated is because they usually contain one or more abstract methods. Abstract methods are methods whose functionality is not defined in the abstract class. However, any class that inherits from an abstract class must define a method of the same name and type in their class definition, which could have allowed for the shared code to have been moved into the parent class so that the class diagram would have looked cleaner. More on abstract classes in Python can be found in the abstract base class documentation, it’s beyond the scope of this blog post to cover any further.

I’m not sure if that whirlwind explanation sold you on the power of inheritance, but that was its intention. Right now we’re in the position of having our general Distribution class and our child Gaussian and Binomial classes. But in order to be of maximum benefit we need to turn these modularised code files into a Python package.

Making a Package out of a Module

Currently, we have our python files in an unpackaged state. You can see how the directory containing the code looks in Figure 13 below. Essentially the directory contains Gaussian, Binomial and Distribution and the tests and data files are all gathered together. What we really want to make this useful is to make the python code files into a package. You can find more information on Python packages in the documentation.

Figure 13: Current layout of Distributions code

So what we’re going to do is move the Python files into a sub-directory called distributions. In order for Python to treat the distributions directory as a package we need to also add a __init__.py file into the distributions directory. Our layout changes to what’s shown in Figure 14, I have excluded the __pychache__ direcotories.

Figure 14: Packagised distribution code

Now if you try to run pytest again, it will no longer work, so we have a few more steps to get this set up correctly. First of all in the “test_distributions.py” file, we need to change the imports to refer to the distributions directory. They should be changed to:

from distributions import Gaussian
from distributions import Binomial

Next, in the “distributions/Gaussian.py” and “distributions/Binomial.py” in order for them to be able to find “distributions/Distribution.py” you must add a . in front of the first Distribution reference in that import line. So the relevant import lines in both files are currently:

from Distribution import Distribution

Which needs to be changed to:

from .Distribution import Distribution

If you run pytest now, you’ll still get failed tests, but it looks less scary. Finally we need to amend the “distributions/__init__.py” file to make the package contents visible to outside references. If we add the following lines:

from .Gaussian import Gaussian
from .Binomial import Binomial

We can now run pytest successfully. This means our package is up and running.

As an additional step, you could add a setup.py file in the directory above distributions with the following contents:

from setuptools import setup

setup(name='distributions',
      version='0.1',
      description='Gaussian distributions',
      packages=['distributions'],
      zip_safe=False)

Warning: If you are not careful with pip install, you can negatively impact upon already installed Python packages. While I’m not aware of whether or not there is a standard pip installable “distributions” package, you should definitely check before pip installing any handcrafted Python packages.

This will now make your package installable using pip: “pip install .“. You should now have the following directory layout after that:

Figure 15: Directory layout with pip installable distributions package

I will add all of these files and push them to the python distributions Github repository for you to play with at your leisure.

Conclusion

I’m going to be honest here. This post took much longer to put together than expected. I’m glad I created it, and I will definitely come back to it as a memory aid for myself at some point in the future. Other than that, if even one section helps someone to understand a topic more fully it’ll be even more worthwhile.

Quite a bit of ground was covered and hopefully I have sparked an interest in some of the topics covered. Most importantly for any budding coders out there “Test Driven Development” seems to be an ever more important paradigm, and if you can drive some of your effort in that direction, it will not be wasted.

Sources

  1. https://docs.github.com/en/github/importing-your-projects-to-github/adding-an-existing-project-to-github-using-the-command-line
  2. Diagram taken from: Venkitapathy, Krishnapriya & Krishnathevar, Ramar. (2010). Comparison of Class Inheritance and Interface Usage in Object Oriented Programming through Complexity Measures. International Journal of Computer Science & Information Technology. 2. 10.5121/ijcsit.2010.2603. Licence: Creative Commons Attribution 4.0 International

Leave a Reply

Your email address will not be published.