Given we have a Solr index with many resources indexed by:
Can we automatically generate initial format identification signatures in the tail of the format distribution, with any degree of confidence?
Inspecting a few by hand looks promising. Most cases seem to fit one of these profiles:
Story needs to start with the size of application/octet-stream, and then generate heuristics for pulling out likely format signatures.
Could also trial it against everything (i.e. ignore DROID/Tika results) and see how well it compares with them.
print("HELO WORLD")
HELO WORLD