Given we have a Solr index with many resources indexed by:

Can we automatically generate initial format identification signatures in the tail of the format distribution, with any degree of confidence?

Inspecting a few by hand looks promising. Most cases seem to fit one of these profiles:

Direct FFB-extension mapping, with very few or no other FFBs that match with any significance.
- Can remove formats you match elsewhere?
Multiple FFBs per extension:
- Apparent families of similar byte mappings, perhaps F2B, but nothing else.
- Messes, with very varied ranges of formats.
Multiple extensions per FFB:
- Containers
- Formats with no in-band format signalling

Story needs to start with the size of application/octet-stream, and then generate heuristics for pulling out likely format signatures.

Could also trial it against everything (i.e. ignore DROID/Tika results) and see how well it compares with them.

In [1]:

print("HELO WORLD")

HELO WORLD

In [ ]: