Splitting a large binary file into sections determined by context patterns

filessplit

I have a large (2GB) file that looks like this:

^%%-=-=-=-=-=-=-=-=-=-=-=-=-=-%%^
<binary data>
^%%-=-=-=-=-=-=-=-=-=-=-=-=-=-%%^ 
<binary data>
^%%-=-=-=-=-=-=-=-=-=-=-=-=-=-%%^
<binary data>
...

The ^%%-=-=-=-=-=-=-=-=-=-=-=-=-=-%%^ lines are separators. The binary segments are large. There are about fifty of them in the file.

I am trying to extract the binary parts of this file. Each binary segment needs to go into its own file.

I tried using csplit,

csplit --digits=2 --prefix=out stu.ear '/\^%%-=-=-=-=-=-=-=-=-=-=-=-=-=-%%\^/'

but received the following output and two out?? files,

1
2097951144

Is there a tool for this job (a csplit implementation that works with binary files, perhaps?)

Best Answer

The following will work:

      awk '/\^%%-=-=-=-=-=-=-=-=-=-=-=-=-=-%%\^/{n++}{print >"out" n ".ear" }

Related Solutions

Splitting large directory tree into specified-size chunks

There exists an application designed for this: dirsplit

It usually lives in cdrkit or dirsplit packages.

It can create ready-to-use folders with links to easily create DVDs with K3b or other GUI software

Linux – how to find offset of one binary file inside another

I could not come up with an existing tool.

grep -F --binary --byte-offset --only-matching seems to be close enough - but you can't escape newlines with -F. And cmp only allows to skip characters. diff also does not seem to be of much help.

But it is a few liner in a programming language with a decent library. For example as a C++ program using Boost:

#include <boost/algorithm/string/find.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <cassert>
#include <iostream>
using namespace boost;
using namespace boost::algorithm;
using namespace boost::iostreams;
using namespace std;

int main(int argc, char **argv)
{
  if (argc != 3) {
    cerr << "Call: " << argv[0] << " PATTERN_FILE SRC_FILE\n";
    return 3;
  }
  mapped_file_source pattern(argv[1]);
  mapped_file_source src(argv[2]);
  iterator_range<const char*> p_range(pattern.data(),
      pattern.data() + pattern.size());
  iterator_range<const char*> s_range(src.data(), src.data() + src.size());
  iterator_range<const char*> result = find_first(s_range, p_range);
  if (result) {
    size_t pos = result.begin()-s_range.begin();
    cout << pos << '\n';
    return 0;
  }
  return 1;
}

You can compile it like this (when the program source is saved as find.cc):

$ make CXXFLAGS="-Wall -g" LDLIBS="-lboost_iostreams" searchb

To test it:

$ dd if=WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3 of=t skip=232323 bs=1 count=4K
$ ls -l t
-rw-r--r-- 1 juser users 4096 2012-05-31 15:24 t
$ ./searchb t WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3
232323

The output is the matching position in the source file.

If the file is not contained the exit status is 1.

Update: In the meantime I've implemented this simple tool in several languages (C/C++/Python/Rust/Go) and have included those implementations in my utility repository. Look for searchb*. The Python implementation is the shortest one and doesn't require any external dependencies.

Best Answer

Related Solutions

Splitting large directory tree into specified-size chunks

Linux – how to find offset of one binary file inside another

Related Question