In an effort to document techniques to deal with incomplete sequences of integers, this notebook shows how to obtain a list of continuous ranges of integers from a single sorted list, e.g. from this:
[1, 2, 3, 7, 8, 9, 10, 23, 24, 25, 27, 28, 35]
to this:
[1, 3], [7, 10], [23, 25], [27, 28], [35]
The resulting list is much more human readable and can be used to describe the corresponding values associated with each range. Isolated values will remain as such. This technique is integrated in https://github.com/steko/missing with other similar tools to list duplicates, missing values and such.
First of all let's import some data to work with:
nctn = open("nctn.txt").readlines()
nums = [int(n.strip()) for n in nctn]
nums = sorted(nums)
Take a look at our data:
print(len(nums))
1527
print(nums[:20])
[2238, 3319, 3324, 3328, 3329, 3330, 3363, 3372, 3376, 3378, 3380, 3382, 3386, 3391, 3392, 3410, 3412, 3413, 3416, 3423]
As you can see there are a few isolated numbers and then some ranges of two numbers. More long-spanning ranges are further down the list.
Define a function that will find the ranges for us:
def ranges(nums):
'''Return a list of continuous ranges (min, max values) from a list of integers.'''
rgs = [[nums[0]]]
# these two make the code more readable and, incidentally, faster
FIRST, LAST = 0, -1
for n in nums:
if rgs[LAST][LAST] == n - 1:
if rgs[LAST][FIRST] == n -1:
rgs[LAST].append(n)
if rgs[LAST][FIRST] < n - 1:
rgs[LAST][LAST] = n
elif rgs[LAST][LAST] < n -1:
rgs.append([n])
return rgs
And now let's see what ranges we have in our data:
rgs = ranges(nums)
for r in rgs:
try:
print('{} - {}: {} numbers'.format(r[0], r[1], r[1] - r[0] + 1))
except IndexError:
print('{}: 1 number'.format(r[0]))
2238: 1 number 3319: 1 number 3324: 1 number 3328 - 3330: 3 numbers 3363: 1 number 3372: 1 number 3376: 1 number 3378: 1 number 3380: 1 number 3382: 1 number 3386: 1 number 3391 - 3392: 2 numbers 3410: 1 number 3412 - 3413: 2 numbers 3416: 1 number 3423: 1 number 3429: 1 number 3462: 1 number 3483: 1 number 3599 - 3602: 4 numbers 3675: 1 number 3689: 1 number 3768: 1 number 3776: 1 number 3786: 1 number 11865: 1 number 11867 - 11964: 98 numbers 13028 - 13087: 60 numbers 23032 - 23074: 43 numbers 47435 - 47443: 9 numbers 55644 - 55651: 8 numbers 55687: 1 number 55737: 1 number 55750 - 55775: 26 numbers 55777 - 55805: 29 numbers 55807 - 55878: 72 numbers 55880 - 55922: 43 numbers 55924 - 55945: 22 numbers 88038: 1 number 88050 - 88053: 4 numbers 88429 - 88826: 398 numbers 96146: 1 number 96149 - 96271: 123 numbers 96273 - 96275: 3 numbers 97763 - 97870: 108 numbers 98033: 1 number 104878 - 104899: 22 numbers 105396 - 105489: 94 numbers 105500: 1 number 105540 - 105569: 30 numbers 105571 - 105605: 35 numbers 105785 - 105821: 37 numbers 105823 - 105854: 32 numbers 105857 - 105926: 70 numbers 105928 - 105938: 11 numbers 105941 - 105945: 5 numbers 106704 - 106753: 50 numbers 107545 - 107561: 17 numbers 1004632 - 1004656: 25 numbers
The dataset we have been working with is rather small, but if you need to crunch big data a fast tool is really needed. Let's see how the ranges
function performs:
%timeit rgs = ranges(nums)
1000 loops, best of 3: 806 µs per loop
Not bad for a naive algorithm like that.