A data structure consisting of an ordered collection of items of a single type i.e. an indexed list.
Also known as a Cartesian Coordinate System which plots numbers on a plane using an x and y axis.
A machine-learning algorithm that determines the class of an input element based on a set of features.
Where the program has to make a decision based on a series of options using conditional statements such as if, else and elif
A character (most typically a comma) used to specify boundaries between words or regions in plain text.
A tree like structure which represents the organization and hierachy of files within a directory. Terms such as parent and child are used to describe relationships between files and folders within this system.
Also known as a Scatter plot. A graph which uses cartesian coordinates to display values for multiple variables of a set of data. Particularly useful for displaying positional information for words within a text.
A cloned copy of a project which is set-up on a independent branch seperate to the original. Often used as a development tool in opensource software - where anyone can create a fork of the program and work on it as a distinct piece of software. Github is an example of a tool which facilitates this sharing and development process.
Put simply, functions provide functionality to a program. They are blocks of organized code which begin with the keyword def proceeded by the name of the function you wish to define in parentheses. The code block begins with a colon and must be indented. Further Information.
Empty spaces used as a formatting tool to designate blocks of code in programming. In Python, indentation is used to indicate a block of code, typically four spaces are used - each line of code in the block must be indented by the same amount of spaces otherwise an error may occur.
The repetition of a procedure in the form of a loop to obtain successively closer approximations to the solution of a problem.
The core computer program of the operating system which can control all system processes. The iPython kernel runs the code in the background for Jupyter notebooks.
A lemma is the canonical form of a word. Lematization is the process of grouping together inflected forms of a word to be analysed as a single item i.e. determining the orginal lemma for the words.
A method for defining and constructing lists. Particularly useful for creating a new list from an exsisting list using expressions with a for / in statement within a set of brackets. Further Information.
Placing objects or elements in a hierarchical arrangement within a set (an ordered collection of immutable objects).
A unit (letter, words etc) of variable size (n = number of units) from a given sequence of text in a corpus used in language modelling. Further information
A process of transforming text into a single canonical form, thereby faciliating data consistentency for further processing. Examples include removing non-alphanumeric characters or changing to lower case.
Data which has attributes or values AND a defined behaviour.
Symbols which perform arithmetic or logical computation. Some basic types of operators used in Python are arithmetic (addition +, modulus % etc), comparison (greater than >, not equal to !=, etc) or logical (and, or, not). Further Information
Parsing or Syntactic Analysis is a process whereby sentences or strings of words are analysed by a computer into their constituents, often this is represented in a parse tree which illustrates this syntactic structure.
Text which includes only data related to the readable material. That is, without data related to grapahical presentation, formatting or other objects such as images. Encoded using Unicode standards, typically in a text editor such as Textedit on Mac or Wordpad on PC. Plain texts are particularly useful for archival storage as they are not confined to proprietary software and can be opened and edited on many systems, thereby ensuring a more universal accessibility and preservation.
The sequence of characters which define a search pattern. These patterns are useful for performing string operations such as find or find and replace
A central location where where data is stored and managed. More specifically, in revision control systems a repository stores metadata for sets of files or directory structure.
Also known as a sparse array. It is a matrix (an array of data arranged in a rectangular structure of columns and rows) in which most of the elements are zero. If most of the elements were populated by values other than zero than the matrix could be considered dense.
The process of reducing a word to it's base form or word stem e.g. added/adding would reduce to add.
A list of words which are programmed to be ignored or filtered in analysis and search queries. Lists of stop-words often contain high frequency function words such as the, of, and etc
A string is a container for data of letters, numbers or symbols.
A data set used to train a model in machine learning. Specific examples are chosen to fit the parameters of the model for training and the subsequent results are compared with a testing dataset.
A sequence of immutable (fixed) objects. Tuples are created by seperating values using commas within a set of parentheses e.g. (1, 2, 3, 4, 5 );
A variable stores a piece of data and gives it a specific name. Common data types which are stored in variables in Python include numbers and Boolean values.
An industry standard in computing for encoding (representing) text. Letters, numbers and symbols are assigned unique numeric values which facilitate universal application across different programs and platforms. A fun example of the utility of unicode is the emoji keyboard used on smartphones when sending messages. The universal nature of unicode allows the emoji's to be accurately represented on most modern phones regardless of their differing operating systems (such as android, ios, blackberry). Further information
CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell
Edited and revised by Melissa Mony