Text Based Programming, One Line at a Time¶

This short course provides an introduction to text based programming, developed and executed (for the most part one) line at time. The course is likely to be of interest to parents and teachers interested in supporting children's learning associated with the new computing elements of the national Curriculum.

The new computing curriculum in the UK for Key Stage 3 includes the following subject content requirement:

Use two or more programming languages, at least one of which is textual, to solve a variety of computational problems; make appropriate use of data structures [for example, lists, tables or arrays]; design and develop modular programs that use procedures or functions.
*[Computing in the national curriculum - A guide for secondary teachers](http://www.computingatschool.org.uk/data/uploads/cas_secondary.pdf)*

The course shows how text based programming can be used execute operations more typically encountered in the context of spreadsheet programming, although presented in a way that allows them them to be read - and debugged - in a far more literate way.

The intuitiveness of spreadsheet development hides to a large degree that it is actually a programming activity. Typing constant values into some cells and a formula into another cell is not seen as programming. It is rather comparable to using a pocket calculator. The immediate presentation of the result even supports this notion. This allows introducing novices without much ado. One learns to use an environment instead of learning a model. While this can be seen as base for the high popularity of spreadsheet systems, it hides the reality that spreadsheet developers are expressing themselves in an inherently functional formula language.
Hodnigg, K., Mittermeir, R., Clermont, M., Computational Models of Spreadsheet-Development Basis for Educational Approaches, Proc. EuSpRIG 2004 Conference, pp. 153-168, London: EuSpRIG

The method we will use takes a similar approach to immediately representing the results of a computation, as demonstrated below.

Before we start, however, we need to get some help. Computer programmers rarely write programmes from scratch. Instead, they build on programmes developed by other people, as well as re-using small programme fragments they have developed themselves.

One collection of programming tools we shall make heavy use of is call pandas. It allows us to work with datasets loaded in from a spreadsheet file or a text based CSV file as if it were a big data table, running operations on it in a similar way to the way we can work with a spreadsheet.

In [3]:

#This is a code cell - we can write programme fragments in code cells and then execute them, one line at a time
#I'm going to load in the pandas programming library, and call it pd.
#Naming it pd is a convention, and, as you will see, makes working with the pandas tools more convenient, in typing terms!
import pandas as pd

When we run the above code cell, it doesn't have an output. If there is an output from the last command in a code cell, it will be displayed immediately below the code cell when we run the code cell - that is, when we execute the code in the code cell.

The following line loads in data from an Excel file (it actually skips the first few rows which don't contain any data).

Following not a v good data file example...

In [5]:

#In a code cell, lines starting with a # are comments that are not executed as code

#The pandas command read_excel will read in an Excel file, given the location and name of the file
#We tell the programme that read_excel can be found in the pandas library by prefixing it with the convenience name, pd
#We can skip a specified number of the leading rows in the file using the skiprows variable
df=pd.read_excel('data/HSCA Active Locations September.xlsx', skiprows=7)
#A variable is like a named container we can set to a particular value (and change the value of, if required)
#The = after the skiprows variable name says that the number 7 is the number of rows we want to skip
#More formally, it assigns the value 7 to the variable skiprows

#Variable assignments have the form:
# variable_name_on_the_left = value_the_variable_takes_on_the_right

#Note that we also assigned the output of the pd.read_excel() command to a variable.
#If we just put the name of a variable at the end of a code cell, it's contents will be displayed
#In the following case, the [] after the variable name limits how many rows are displayed to the first 3 rows
df[:3]

Out[5]:

	Location ID	HSCA start date	Care home?	Location Name	Telephone Number	Registered manager (note; where there is more than one manager at a location, only one is included here for ease of presentation. The full list is available if required).	Web Address	Care homes beds	Location Type/Sector	Region	...	Service user band - Learning disabilities or autistic spectrum disorder	Service user band - Mental Health	Service user band - Older People	Service user band - People detained under the Mental Health Act	Service user band - People who misuse drugs and alcohol	Service user band - People with an eating disorder	Service user band - Physical Disability	Service user band - Sensory Impairment	Service user band - Whole Population	Service user band - Younger Adults
0	1-1000210669	2013-12-12	Y	Kingswood House Nursing Home	01424716303	Hogan, Deidre	NaN	22	Social Care Org	South East	...	NaN	Y	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Y
1	1-1000270393	2013-10-16	N	Red Kite Home Care	NaN	Hall, Pearl	NaN	0	Social Care Org	South East	...	NaN	Y	Y	NaN	NaN	NaN	Y	NaN	NaN	Y
2	1-1000312641	2013-10-18	N	Homecare Support Limited (Sale)	01619429490	Buckley, Michelle	www.homecaresupport.co.uk	0	Social Care Org	North West	...	NaN	Y	Y	NaN	Y	Y	Y	Y	NaN	Y

3 rows × 89 columns

The single code line, hidden amongst the comments in the previous code cell, loaded in some data from an Excel file and placed it in the variable df. The data file comes from the Care Quality Commission (CQC), and identifies all the locations inspected by the CQC.

Note the column "Care home beds", which contains the number of beds in locations that are care homes.

We can find the total number of care home beds with a single command.

In [6]:

#Specify a column name, and the operation we want to apply to it, in this case "sum"
df['Care homes beds'].sum()

Out[6]:

463812.0

We can also find the average - or mean - number of beds as the total number of beds divided by the number of rows:

In [7]:

df['Care homes beds'].mean()

Out[7]:

9.3181717729784026

Does that calculation make sense though? The CQC inspects lots of locations that aren't care homes - so this average is not calculated over the total number of care homes at all. It's calculated over the total number of locations the CQC inspects.

In [21]:

#Number of locations - the len() command counts the number of rows (that is, the length) of the data table
len(df)

Out[21]:

The Care home? column identifies whether a location is a care home (Y) or not a care hame (N).

We can generate a subset of the data that just contains the care homes by filtering the original dataset in the following way:

In [22]:

#Create a new variable, called carehomes, that just contains rows where the "Care home?' column value is Y
careHomes=df[ df['Care home?']=="Y" ]
len(careHomes)

Out[22]:

In [23]:

#Now find the mean number of care home beds in locations that are care homes
careHomes['Care homes beds'].mean()

Out[23]:

26.914176289676782

Some better examples, with one liner graphics, here: ggplot examples.

A chrome browser extension runs IPython notebooks - though not currently with support for ggplot: Working With Data in the Browser Using python – coLaboratory.

IPython notebooks can also be run as a hosted, cloud based app, or as per TM351, in a virtual machine on a host desktop.