The National Archives provide a web based search interface for searching index catalogues of various National Archives collections.
As well as a simple search box that does a free text search over all record columns (presumably?), we can also run advanced searches that can include reference and date limits.
Search results containing the index records for your search hits can be downloaded as a CSV file.
By searching for records associated with a particular collection tag / reference, we can obtain, and thence download, a copy of the collection's index records.
We can then load these records into our own database and search them using our own search tools, as well as annotation the records using things like named entity recognition.
So let's have a go at that...
Searching for index records associated with HO-40-1
over the period 1800-15
leads us to a search results page with the URL:
https://discovery.nationalarchives.gov.uk/results/r?_cr=HO%2040-1&_dss=range&_sd=1810&_ed=1815&_ro=any&_st=adv
This HTTP GETs the URL https://discovery.nationalarchives.gov.uk/results/r
with arguments:
_cr:'HO 40-1'
_dss:'range'
_sd:1810
_ed:1815
To download the data records, we then need to click a form button, rather than a web link.
We can automate this procedure by constructing the desired URL, with appropriate arguments, ensuring the correct form download options are set, "click" the download button and capture the response.
# Mechanical soup is a combination of a simple virtual browser (mechanize) and
# a web scraping package (BeautifulSoup)
import mechanicalsoup
Define the URL of the search results and download page:
url='https://discovery.nationalarchives.gov.uk/results/r'
Specify the search limits around the collection we are interested in:
params = {'_cr':'HO 40-1','_dss':'range','_sd':1810,'_ed':1815}
Open the page with those parameters:
browser = mechanicalsoup.StatefulBrowser()
browser.open(url, params=params)
<Response [200]>
Configure the search form:
browser.select_form('form[action="/search/download"]')
browser["expSize"] = "10"
#browser.get_current_form().print_summary()
"Click" the download button:
response = browser.submit_selected()
Read the response into a pandas dataframe and preview the result, casting date fields into date format:
#StringIO is a function for wrapping a file pointer around a string
from io import StringIO
#Pandas is a package for working with tabular datasets
import pandas as pd
df = pd.read_csv(StringIO(response.text))
#Force the start and end date columns into a date format
df['Start Date'] = pd.to_datetime(df['Start Date'],errors='coerce', dayfirst=True)
df['End Date'] = pd.to_datetime(df['End Date'],errors='coerce', dayfirst=True)
df.head(3)
Citable Reference | Context Description | Title | Description | Start Date | Start Date (num) | End Date | End Date (num) | Covering Dates | Held by | Catalogue level | References | Opening Date | Closure Status | Closure Type | Closure Code | Subjects | Digitised | ID | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HO 40/1 | Home Office: Disturbances Correspondence. | HO 40. The Luddite riots - reports | HO 40. The Luddite riots - reports. | 1812-01-01 | 18120101 | 1855-12-31 | 18551231 | 1812-1855 | The National Archives, Kew | 6 | NaN | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | Yes | C3083303 | 0.177554 |
1 | HO 40/1/6 | Home Office: Disturbances Correspondence. HO 4... | Lancashire. Lt. Gen. (copies of (1) above) Mai... | Lancashire. Lt. Gen. (copies of (1) above) Mai... | 1812-05-01 | 18120501 | 1812-06-30 | 18120630 | 1812 May - June | The National Archives, Kew | 7 | \r\nFormer Reference Pro: HO 40/1/(6) | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | NaN | C6573173 | 0.158834 |
2 | HO 40/1/7 | Home Office: Disturbances Correspondence. HO 4... | Yorkshire magistrates reports (copies of (1) a... | Yorkshire magistrates reports (copies of (1) a... | 1812-03-01 | 18120301 | 1812-05-31 | 18120531 | 1812 Mar. - May | The National Archives, Kew | 7 | \r\nFormer Reference Pro: HO 40/1/(7) | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | NaN | C6573174 | 0.158834 |
We can build up a larger index by extending our search, or by combining the downloads from mutliple searches.
Create a function to do the download of a single index:
def get_index(reference, start=1810, end=1815, typ='ref'):
"""Download index for a specify reference and convert it to a dataframe."""
url='https://discovery.nationalarchives.gov.uk/results/r'
params = {'_dss':'range','_sd':start,'_ed':end}
if typ=='search':
params['_q']=reference
else:
params['_cr']=reference
browser = mechanicalsoup.StatefulBrowser()
browser.open(url, params=params)
#No results
if browser.get_current_page().find("div", {"class": "emphasis-block no-results"}):
return pd.DataFrame()
browser.select_form('form[action="/search/download"]')
browser["expSize"] = "10"
response = browser.submit_selected()
_df = pd.read_csv(StringIO(response.text))
#Force the start and end date columns into a date format
_df['Start Date'] = pd.to_datetime(_df['Start Date'], errors='coerce', dayfirst=True)
_df['End Date'] = pd.to_datetime(_df['End Date'], errors='coerce', dayfirst=True)
return _df
get_index('HO 42').head(3)
Citable Reference | Context Description | Title | Description | Start Date | Start Date (num) | End Date | End Date (num) | Covering Dates | Held by | Catalogue level | References | Opening Date | Closure Status | Closure Type | Closure Code | Subjects | Digitised | ID | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HO 42 | Home Office: Domestic Correspondence, George III. | Home Office: Domestic Correspondence, George III | Original Home Office domestic letters. PLEASE ... | 1782-01-01 | 17820101 | 1820-12-31 | 18201231 | 1782-1820 | The National Archives, Kew | 3 | NaN | NaN | NaN | Normal Closure before FOI Act: | 30 | NaN | NaN | C8906 | 0.032754 |
1 | HO 42/108 | Home Office: Domestic Correspondence, George III. | HO 42. Letters and Papers. Supplementary. | HO 42. Letters and Papers. Supplementary. | 1810-07-01 | 18100701 | 1810-10-31 | 18101031 | 1810 July 01-1810 Oct 31 | The National Archives, Kew | 6 | NaN | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | NaN | Yes | C1905727 | 0.026892 |
2 | HO 42/111 | Home Office: Domestic Correspondence, George III. | HO 42. Letters and Papers | HO 42. Letters and Papers. | 1811-04-01 | 18110401 | 1811-06-30 | 18110630 | 1811 Apr 01-1811 June 30 | The National Archives, Kew | 6 | NaN | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | NaN | Yes | C1905730 | 0.026892 |
Note that some searches seem to be quite wideranging against particular codes (rather than lookups by reference), and some responses also appear to contain transcipts in the Description
field.
#Pull out the first 500 characters of records longer than 3000 characters
[r[:500] for r in get_index('HO 42',typ='search')['Description'].to_list() if len(r)>2000]
['Report of Soulden Lawrence on 16 individual petitions (13 from the prisoner; H Neale, officer of marines; Mr Castle, Clerk of the Crown for Durham and A Graham) and 4 collective petitions (34 members of the corporation of Durham; 2 others (31 and 34 people) with similar signatories and 3 people, the prisoner and others of London) on behalf of John Davison, late a captain in the Royal Marines, convicted at the Somerset Assizes held in Taunton in August 1809, for the theft of 6 yards of muslin, va', 'General registers, early warrant and entry books and other records covering the multifarious subjects for which the Home Office has had responsibility; also records of subjects which do not fit into other divisional categories. Broadly, the subjects and their series in this division are as follows: Addresses, HO 55, HO 57, HO 249 Admiralty, HO 28, HO 29 Advertisements, HO 174 Animals and wild birds, HO 183, HO 285 Automatic data processing, HO 337 Betting, gaming and lotteries, HO 320 Bouillon p', "Board and Committee minutes HO.RVI/1/1-7 Board of Governors minutes, Court of Governors before 1948. 1911 - 1971 (7 volumes) See also HO.RVI/47 for Joint Minutes with Regional Hospital Board. HO.RVI/2/1-61 House Committee minutes, 1751 - 1971 (61 volumes) Volumes 1-4, 6-11, 17-18 contain patients' admissiosn and discharges. HO.RVI/151/1-6 Rough House Committee minutes, 1752 - 1755 (6 volumes) HO.RVI/3/1-2 Anaesthetic Committee minutes, 1924 - 1948 (2 volumes) HO.RVI/4/1-2 Appeal Committee minute", 'Administration HO.PM/1/1-18 Minutes 1760 - 1945 From 1760 to 1822 weekly court minutes, to 1900 also House Committee minutes, from 1900 also Finance Committee and Management Committee minutes. (18 volumes, 27 papers) HO.PM/2 Charity for the Relief of Poor Women Lying-in at Their Own Homes, minutes 1787 - 1858 Lying-in hospital House Committee minutes, 1859 (1 volume) HO.PM/3/1-3 Medical Staff meetings minutes, 1917 - 1951 (3 volumes) HO.PM/45 Honorary medical staff meetings minutes, 1935 - 1949 ', "Report of Alexander Thomson on 1 individual petition (the prisoner [detailed, gives information concerning family and business]) on behalf of Peter Degraves, merchant of London and Manchester, Lancashire, tried at the 'last' Lancaster Assizes held in 1810 and convicted of stealing a large quantity of goods including French cambrics, value between £2,000-3,000, property of John Parson, merchant of Manchester, from the warehouse of Thomas Benbridge/Thomas Bainbridge. Evidences supplied by John Pa", '1707-1812 Watchet Harbour, copies of Acts, 1707-08, 1720-21. 1770. 1809 printed and ms.) with petitions etc.; copies of Minehead Harbour Act 1711; accounts and estimates for repair c.1720 with undated agreement to build a quay at Watchet by Wm. Rowe of Bridgwater, mason, 1708; petitions re. need for improvements, 1811 with correspondence, 1812. 1 bundle 1707-1809 ms copies of Watchet Harbour Acts (as above). 1 volume 1772-1808 Watchet Quay maintenance accounts 1772-1808 (1 volume) 1782-1808 (2 c', "HIL/1 Records of firm and Hillman family HIL/2 Co-partnership agreements HIL/3 Apprenticeship indentures HIL/4 Assignments of debts HIL/5 Wills and executorship papers HIL/6 Title deeds of clients' properties HIL/6/1-14 Lewes: St Thomas at Cliffe HIL/6/15-29 Lewes: other parishes HIL/6/30-32 Alfriston HIL/6/33 Arlington HIL/6/34-36 Barcombe HIL/6/37 Bishopstone HIL/6/38-40 Brighton HIL/6/41 Ditchling HIL/6/42-48 Eastbourne HIL/6/49 Framfield HIL/6/50 Friston HIL/6/51-53 Hailsham HIL/6/54 Helling", 'SUMMARY OF CONTENTS L/C Lieutenancy - County L/C/D Deputy Lieutenancy L/C/C County - commissions L/C/C/1 Original commissions; 1853-1861 L/C/C/2 Letters of royal approval; 1835-1913 L/C/C/3 Correspondence and papers; 1778-1915 L/C/C/4 Draft commissions and precedents; 1804-1870 L/C/C/5 Lists of Deputy Lieutenants; 1807-1852 L/C/G General meetings L/C/M Militia L/C/M/1 Lists of men enrolled; 1803-1855 L/C/M/2 Subdivisional returns; 1806-1831 L/C/M/3 Regimental returns; 1804-1874 L/C/M/4 Correspon']
We can now use that function to download and combine indexes for multiple references:
search_references = ['HO 40-1', 'HO 40-2', 'HO 43-19', 'HO 43-20', 'HO 43-21', 'HO 42-110']
df_combined = pd.DataFrame()
for reference in search_references:
_df = get_index(reference)
print(f'{reference}: {len(_df)}')
df_combined = df_combined.append( _df )
df_combined = df_combined.sort_values('Citable Reference').reset_index(drop=True)
df_combined.head()
HO 40-1: 9 HO 40-2: 10 HO 43-19: 1 HO 43-20: 1 HO 43-21: 1 HO 42-110: 1
Citable Reference | Context Description | Title | Description | Start Date | Start Date (num) | End Date | End Date (num) | Covering Dates | Held by | Catalogue level | References | Opening Date | Closure Status | Closure Type | Closure Code | Subjects | Digitised | ID | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HO 40/1 | Home Office: Disturbances Correspondence. | HO 40. The Luddite riots - reports | HO 40. The Luddite riots - reports. | 1812-01-01 | 18120101 | 1855-12-31 | 18551231 | 1812-1855 | The National Archives, Kew | 6 | NaN | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | Yes | C3083303 | 0.177578 |
1 | HO 40/1/1 | Home Office: Disturbances Correspondence. HO 4... | Cheshire, Lancashire, Yorkshire ff 1-173 ff 17... | Cheshire, Lancashire, Yorkshire ff 1-173 ff 17... | 1812-03-01 | 18120301 | 1812-06-30 | 18120630 | 1812 Mar. - June | The National Archives, Kew | 7 | \r\nFormer Reference Pro: HO 40/1/(1) | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | NaN | C6573168 | 0.152372 |
2 | HO 40/1/2 | Home Office: Disturbances Correspondence. HO 4... | Cheshire magistrates reports (copies of (1) ab... | Cheshire magistrates reports (copies of (1) ab... | 1812-03-01 | 18120301 | 1812-06-30 | 18120630 | 1812 Mar. - June | The National Archives, Kew | 7 | \r\nFormer Reference Pro: HO 40/1/(2) | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | NaN | C6573169 | 0.152372 |
3 | HO 40/1/3 | Home Office: Disturbances Correspondence. HO 4... | Lancashire magistrates reports (copies of (1) ... | Lancashire magistrates reports (copies of (1) ... | 1812-03-01 | 18120301 | 1812-05-31 | 18120531 | 1812 Mar. - May | The National Archives, Kew | 7 | \r\nFormer Reference Pro: HO 40/1/(3) | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | NaN | C6573170 | 0.156902 |
4 | HO 40/1/4 | Home Office: Disturbances Correspondence. HO 4... | Lancashire magistrates reports (copies of (1) ... | Lancashire magistrates reports (copies of (1) ... | 1812-03-01 | 18120301 | 1812-06-30 | 18120630 | 1812 Mar. - June | The National Archives, Kew | 7 | \r\nFormer Reference Pro: HO 40/1/(4) | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | NaN | C6573171 | 0.157119 |
We can get a better view over the descriptions:
df_combined['Description'].to_list()
['HO 40. The Luddite riots - reports.', 'Cheshire, Lancashire, Yorkshire ff 1-173 ff 174-283.', 'Cheshire magistrates reports (copies of (1) above) ff 284-341.', 'Lancashire magistrates reports (copies of (1) above) ff 342-371.', 'Lancashire magistrates reports (copies of (1) above) ff 372-471.', 'Enclosures to a letter dated (copies of (1) above) 16 May, 1812 in (4) above ff 472-485.', "Lancashire. Lt. Gen. (copies of (1) above) Maitland's reports ff 486-540.", 'Yorkshire magistrates reports (copies of (1) above) ff 541-596.', 'Yorkshire Sir Francis Lindley (copies of (1) above) Wood, Vice Lt. West Riding; reports ff 597-624.', 'HO 40. The Luddite riots - military reports.', 'Cheshire ff 1a - 115.', 'Lancashire ff 116-253.', 'Yorkshire ff 254-399 ff 400-562.', 'Chelmsford, London and miscellaneous ff 563-646.', 'Notebook containing names of known and suspected Luddites.', 'Notebook containing various payments to constables, etc.', 'Copies of letters addressed to Lt. Gen. Maitland.', 'Copies of letters addressed to Lt. Gen. Maitland.', 'Copies of letters addressed to Lt. Gen. Maitland.', 'HO 42. Letters and Papers.', 'Domestic Letter Book.', 'Domestic Letter Book.', 'Domestic Letter Book.']
The title field appears to be a subset of the description field (up to the first N characters).
We can parse named entities out of the description field to make searching the records easier.
The spacy
natural language processing (NLP) package provides a named entity tagger that is good enough to get us started.
import spacy
#Install the package that provides the named entity model
#!python -m spacy download en_core_web_sm
Here's an example of running the named entity tagger:
nlp = spacy.load("en_core_web_sm")
TEST_STRING = "Joseph Radcliffe, wrote a letter to the Home Office on March 5th, 1812 about the Luddites."
doc = nlp(TEST_STRING)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Joseph Radcliffe 0 16 PERSON the Home Office 36 51 ORG March 5th, 1812 55 70 DATE Luddites 81 89 GPE
GPE
is a "geo-political entity". There is also a related NORP
: "nationalities or religious or political groups". The numbers are the index values identifying the first and last character of the extracted string in the original string.
We can create a simple function to pull out the elements we want, returning a list of all elements extracted from a block of text.
def entity_rec(txt):
"""Extract entities from text and return a list entity text and entity type tuples."""
doc = nlp(txt)
ents = []
for ent in doc.ents:
#ents.append((ent.text, ent.start_char, ent.end_char, ent.label_))
#Exclude certain entity types from the returned list
if ent.label_ not in ['CARDINAL']:
ents.append((ent.text, ent.label_))
return ents
We can apply this function to the Description
text associated with each row:
df['Entities'] = df['Description'].apply(lambda x: entity_rec(x))
df.head(3)
Citable Reference | Context Description | Title | Description | Start Date | Start Date (num) | End Date | End Date (num) | Covering Dates | Held by | ... | References | Opening Date | Closure Status | Closure Type | Closure Code | Subjects | Digitised | ID | Score | Entities | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HO 40/1 | Home Office: Disturbances Correspondence. | HO 40. The Luddite riots - reports | HO 40. The Luddite riots - reports. | 1812-01-01 | 18120101 | 1855-12-31 | 18551231 | 1812-1855 | The National Archives, Kew | ... | NaN | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | Yes | C3083303 | 0.177554 | [(Luddite, NORP)] |
1 | HO 40/1/6 | Home Office: Disturbances Correspondence. HO 4... | Lancashire. Lt. Gen. (copies of (1) above) Mai... | Lancashire. Lt. Gen. (copies of (1) above) Mai... | 1812-05-01 | 18120501 | 1812-06-30 | 18120630 | 1812 May - June | The National Archives, Kew | ... | \r\nFormer Reference Pro: HO 40/1/(6) | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | NaN | C6573173 | 0.158834 | [(Lancashire, ORG), (Maitland, GPE)] |
2 | HO 40/1/7 | Home Office: Disturbances Correspondence. HO 4... | Yorkshire magistrates reports (copies of (1) a... | Yorkshire magistrates reports (copies of (1) a... | 1812-03-01 | 18120301 | 1812-05-31 | 18120531 | 1812 Mar. - May | The National Archives, Kew | ... | \r\nFormer Reference Pro: HO 40/1/(7) | NaN | Open Document, Open Description | Normal Closure before FOI Act: | 30 | C10086 Public disorder | NaN | C6573174 | 0.158834 | [(Yorkshire, PERSON)] |
3 rows × 21 columns
We can then generate a long format data frame that associates each entity tuple with each record, as identified by the record ID
:
df_entities = df.explode('Entities').reset_index(drop=True)[['ID','Entities']]
df_entities.head(3)
ID | Entities | |
---|---|---|
0 | C3083303 | (Luddite, NORP) |
1 | C6573173 | (Lancashire, ORG) |
2 | C6573173 | (Maitland, GPE) |
We can then split out the entity tuple elements into separate columns, noting that the entity type recognition, as well the entity extraction, may be a bit ropey:
df_entities[['Entity','Type']] = df_entities['Entities'].apply(pd.Series)
df_entities.drop(columns='Entities', inplace=True)
df_entities.head(10)
ID | Entity | Type | |
---|---|---|---|
0 | C3083303 | Luddite | NORP |
1 | C6573173 | Lancashire | ORG |
2 | C6573173 | Maitland | GPE |
3 | C6573174 | Yorkshire | PERSON |
4 | C6573175 | Yorkshire Sir | PERSON |
5 | C6573175 | Francis Lindley | PERSON |
6 | C6573175 | West Riding | GPE |
7 | C6573171 | Lancashire | ORG |
8 | C6573170 | Lancashire | ORG |
9 | C6573168 | Cheshire | ORG |
If we wanted to work on this a bit more, it would be handy to try be be able to recognise English county and placenames as such. We could also try to munge any DATE
elements through a robust date parser in order to get the dates into an actual date object.
One other useful bit of information are the folio / page numbers.
import re
TEST_STRING_2 = "Cheshire, Lancashire, Yorkshire ff 1-173 ff 174-283."
FF_PATTERN = r"ff \d+-\d+"
m = re.findall(FF_PATTERN, TEST_STRING_2, re.MULTILINE)
m
['ff 1-173', 'ff 174-283']
Again, we can capture these into a long dataframe:
df['Pages'] = df['Description'].apply(lambda x: re.findall(FF_PATTERN, x, re.MULTILINE))
df[['Description','Pages']].head(10)
Description | Pages | |
---|---|---|
0 | HO 40. The Luddite riots - reports. | [] |
1 | Lancashire. Lt. Gen. (copies of (1) above) Mai... | [ff 486-540] |
2 | Yorkshire magistrates reports (copies of (1) a... | [ff 541-596] |
3 | Yorkshire Sir Francis Lindley (copies of (1) a... | [ff 597-624] |
4 | Lancashire magistrates reports (copies of (1) ... | [ff 372-471] |
5 | Lancashire magistrates reports (copies of (1) ... | [ff 342-371] |
6 | Cheshire, Lancashire, Yorkshire ff 1-173 ff 17... | [ff 1-173, ff 174-283] |
7 | Cheshire magistrates reports (copies of (1) ab... | [ff 284-341] |
8 | Enclosures to a letter dated (copies of (1) ab... | [ff 472-485] |
We can make the table longer by exploding multiple page references for any given record, and then also splitting out the first and last page reference:
df_pages = df.explode('Pages').reset_index(drop=True)[['ID','Pages']].dropna()
df_pages[['Start', 'End']] = df_pages['Pages'].str.replace('ff','').str.strip().str.split('-').apply(pd.Series)
df_pages.sort_values(['ID','Start'], inplace=True)
df_pages.reset_index(drop=True, inplace=True)
df_pages.head(10)
ID | Pages | Start | End | |
---|---|---|---|---|
0 | C6573168 | ff 1-173 | 1 | 173 |
1 | C6573168 | ff 174-283 | 174 | 283 |
2 | C6573169 | ff 284-341 | 284 | 341 |
3 | C6573170 | ff 342-371 | 342 | 371 |
4 | C6573171 | ff 372-471 | 372 | 471 |
5 | C6573172 | ff 472-485 | 472 | 485 |
6 | C6573173 | ff 486-540 | 486 | 540 |
7 | C6573174 | ff 541-596 | 541 | 596 |
8 | C6573175 | ff 597-624 | 597 | 624 |
When downloading a scanned collection from the National Archives, the scan associated with a reference, for example, the scan associated with HO 40/1
, may be split into several separate PDF documents.
We can merge these into a single document, which makes working with it slghtly easier from a programmatic point of view, albeit at making the memory requirements when dealing with a particular collection slightly heavier...
The following cell finds the filenames of all the PDFs I downloaded as part of the HO-40-1
download and sorts them.
from os import listdir
reference = 'HO-40-1'
pdfs = [f'../HO - Home Office/{f}' for f in listdir('../HO - Home Office') if f.startswith(reference)]
pdfs.sort()
pdfs[:3]
['../HO - Home Office/HO-40-1_01.pdf', '../HO - Home Office/HO-40-1_02.pdf', '../HO - Home Office/HO-40-1_03.pdf']
We can then merge all these separate PDFs into a single PDF and save it as a new file:
from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
#Save the merged PDF
merger.write(f"{reference}_result.pdf")
merger.close()
We can view specified pages within the merged PDF as an image file, converted from the PDF using ImageMagick, at a specific page number.
page_num = 500
#The wand package provides a Python API for the Imagemagick application
#!pip3 install --user Wand
from wand.image import Image as WImage
print(f'Displaying at PDF page {page_num}.')
WImage(filename=f'{reference}_result.pdf[{page_num}]',resolution=200)
Displaying at PDF page 500.