How to export all OneNote pages to individual markdown files

exportmarkdownmicrosoft-onenotemicrosoft-onenote-2016

I am moving to Linux and the last hurdle is to get out of OneNote. I'd like to export all of my notebooks so that every page goes to an individual markdown file.

I've tried many things—this thread had several suggestions, but they are all outdated.

If I could get OneNote to export all of the pages as individual .docx files, it would be easy to use pandoc to convert them to individual markdown files. But, OneNote will only export multiple pages as a single file. So, one route would be to find a way to automate the export of every single page individually.

Another option is to export entire notebooks at a time as .docx files, convert them to markdown with pandoc, and then split the files—but, I am not enough of a regex wizard to get csplit to cut the files correctly with just its basic regular expressions, and not enough of an awk wizard to get it to output files with the correct and full regular expression.

Can anyone help me with this?

Best Answer

I did end up finding an export pipeline, but it was a pain. Here are my notes from doing that:

workflow:

Turn off your network(s) to prevent OneNote from performing a lengthy OneDrive sync after each export.
In the Notebooks list, expand the notebook to see all the tabs.
Right click a tab and click "Export...".
Click the filetype dropdown and press M to select .docx format. Press Enter to select it.
Press Enter again to save the exported file.
Repeat steps 2-5 for each tab in the notebook.
Set up pandoc and open a PowerShell or cmd window.
cd into the directory where the exported .docx files are located.
For each exported .docx file, use the following pandoc command to convert it to markdown (replace journal with the name of your file):
```
pandoc --extract-media='' --wrap=preserve '.\journal.docx' -o journal.md
```
Here's an explanation of the command: --extract-media='' tells pandoc to extract images from the .docx file and put them in the default subfolder (named 'media' by default). --wrap=preserve tells pandoc not hard-wrap the output file with linebreaks (which is the default). The next field is the input filename, and -o stands for 'output', so journal.md is the output filename.

If you don't want to split this file (for example, if your tab contained only one page), skip to step 15.

(When you are doing a bunch of these, you can press the ↑ (up arrow) key to recall the previous command in the shell, then edit the filename.)
Create a new folder to store the pages in the tab. For this example, right now all the pages from our Journal tab in OneNote are mashed together in journal.md. Make a folder called journal which will store the final separated pages as individual .md files.
If there were any images in the .docx file, these will be exported to a new folder called media. Drag the media folder, if it exists, into the folder you just created now. (This is why we need to do each pandoc operation seperately, because each export will create a separate media folder, and we want to keep these separate so the links in the markdown files work correctly. We could write a clever script to do all this automatically, but it will take less time to just do it manually, unless you have a huge number of notebooks.) (Note: You can save a step by putting your desired foldername in the single quotes of the --extract-media='' argument—for .docx files with images, a folder will be created automatically for you.)
Open a bash terminal and cd to the directory containing the .md file. The folder you created in step 10 must be a subfolder of this one (unless you fix the path in the following command).
If you haven't already, click the Windows Bash window icon, click on Properties, check QuickEdit Mode, then click OK. Now click on the Windows Bash window icon again, this time click on Defaults, check QuickEdit Mode, and then click OK (so new Bash windows you create in the future will remember this setting). Now you can select text in the terminal and press Ctrl+C to copy, or right-click the terminal window to paste the text into the clipboard. Now we can prepare our command in a separate location, and quickly paste each version into Bash.
Customize the following command and run it for each .md file you want to split into individual pages:

csplit ./journal.md --keep-files --prefix='journal/journalentry ' --suffix-format='%i.md' --elide-empty-files '/^$Monday\|Tuesday\|Wednesday\|Thursday\|Friday\|Saturday\|Sunday$,/-2' '{*}'

(Type it as one line.)

As you can see, journal.md is the name of our markdown file (in the current directory, denoted by ./), the second occurrence of journal (after --prefix=') is the name of our subfolder which will contain the split files, and journalentry is what each file will be named (followed by an index number).

If you want to understand the command, here's an explanation: --keep-files still prints files when errors or the end of the file is encountered, ensuring the last page will be output correctly (since it doesn't end in the pattern of our regular expression). --prefix sets the naming scheme of the output files. --suffix-format allows us to set our file extension (.md in this case), but we must include %i for the sprintf statement which outputs the index number of the file. --elide-empty-files skips outputting empty files, which we don't care about. Finally, the regular expression, which begins with '/ and ends with /-2', defines when to split the file: it says "When you find (/) at the start of the line (^) the following (() Monday or (\|) Tuesday or Wednesday or Thursday or Friday or Saturday or Sunday ()) followed by a comma, step back two lines (-2)" and split the file there, outputting what we have up until now. The final bit, '{*}', repeats the previous command indefinitely, until the end of the file is reached.
Drag the .docx and .md files into a folder, say a folder you create now called intermediates. Or you can just delete them. It's nice to save them for a while, until you are comfortable with your new file format, in case you want to go back and reference something that happened during the conversion process. Moving them into the intermediates folder now will rude the chance of forgetting where we are and repeating steps.
Repeat steps 9-14 for each .docx file you exported from OneNote.
Now you have one folder for each tab, with a bunch of separate .md files in it, one for each page! Plus a media folder in each subfolder that had images in the OneNote tab.
I recommend exporting each of your OneNote notebooks as a .mht file (Single File Web Page), or, if you prefer, a .pdf. This way, if there was lost formatting or other information in some of your markdown files, due to the multiple conversion, you can always go back and easily see how it was supposed to look in the .mht file. In addition, I would recommend exporting each of your OneNote notebooks as a .onepkg file (OneNote Package), so you have a nice final export copy if you ever want to reopen the Notebook in OneNote in its native/original file format (this might be useful if, for example, the .mht file is also missing some original formatting that you want to recover).
As you finish each notebook, right-click the notebook in OneNote and click "Close This Notebook" so you won't accidentally edit the notebook and have to re-export your new changes. For the markdown folders, I also created a folder for each notebook, and put all the tab folders in it.
When you are finished with the whole export project, you can go to your OneDrive and delete all your OneNote notebooks originals which have synced there (make sure you are backing up your own files now, of course! There is OneDrive for Linux, or you could try something like Syncthing).
Finally, we can rename all our .md files to their OneNote page title, which is the first line in each file, by using two scripts. Make the following files:

File 1: ~/scripts/rename-files-to-first-line.sh
```
for i in *md ; do mv -n "$i" "$(cat "$i"|head -n1|tr -d '\000-\037[]{}()/\?*')".md; done
```
File 2: ~/scripts/recurse.sh
```
CDIR=$(pwd)
for i in $(ls -R | grep :); do
    DIR=${i%:}                    # Strip ':'
    cd $DIR
    $1                            # Your command
    cd $CDIR
done
```
Then navigate to your notes folder and run the recurse.sh command using the rename-files-to-first-line.sh command as an argument:
```
$ ~/scripts/recurse.sh ~/scripts/rename-files-to-first-line.sh
```
You will see the script go through all your files recursively, throwing some errors on files with weird first lines (that won't convert to a filename) and on other edge cases. However, the mv command in rename-files-to-first-line is executed with argument -n, which will prevent it from overwriting any files. There might be a few notes that don't get renamed, because the first line in them is blank or something else weird, but you can just fix those few files manually.
Bask in your clean escape from OneNote.

Caveats:

This doesn't capture subpages—you will have to recreate those with sub-subfolders, if you like.
I don't know how well it does with tables—markdown is a bit ungainly for tables anyway.
There are probably other kinds of formatting, such as fonts, which get lost or screwed up in the export. But for rich text and images, it works pretty well!

Examples: dpi, width, height.

If you give it the dpi information:

Add the --dpi option as stated to override the default.

If most of your pictures have a common height or width, that should be easily corrected.

For example, you changed the line to:

![my caption](./figures/myimage.png){ width=250px }

![my caption](./figures/myimage.png){ height=256px }

Or do this in straight HTML markup:

<img src="./figures/myimage.png" alt="my caption" style="width: 250px;"/>

<img src="./figures/myimage.png" alt="my caption" style="height: 256px;"/>

and the ratio will be correct.

Reference: Pandoc Readme

For HTML and EPUB, all attributes except width and height (but including srcset and sizes) are passed through as is. The other writers ignore attributes that are not supported by their output format.

The width and height attributes on images are treated specially. When used without a unit, the unit is assumed to be pixels. However, any of the following unit identifiers can be used: px, cm, mm, in, inch and %.

Dimensions are converted to inches for output in page-based formats like LaTeX. Dimensions are converted to pixels for output in HTML-like formats. Use the --dpi option to specify the number of pixels per inch. The default is 96dpi.

The % unit is generally relative to some available space. For example the above example will render to <img href="file.jpg" style="width: 50%;" /> (HTML), \includegraphics[width=0.5\textwidth]{file.jpg} (LaTeX), or \externalfigure[file.jpg][width=0.5\textwidth] (ConTeXt).

Some output formats have a notion of a class (ConTeXt) or a unique identifier (LaTeX \caption), or both (HTML).

When no width or height attributes are specified, the fallback is to look at the image resolution and the dpi metadata embedded in the image file.

Compile PDF book from multiple Markdown files

I assume your recipeX.md files already have headlines by themselves? And they are all at level 2?

Then a small series of (Linux or macOS) shell command (or a little script) like the following should work (use the most recent version of Pandoc!):

 
cd Recipes ;
for i in */ ; do
   echo "# ${i%/}" ;
   echo " " ;
   for j in $i/*.md ; do
      cat $i/$j ;
      echo ; 
   done ;
done ; \
|      \
pandoc                        \
  --toc                       \
  --number-sections           \
  --top-level-division=part   \
  --output=my-recipe-book.pdf \
  -

Warning: do not miss the last '-' in the code above!
If your recipeX.md files do NOT have headlines, try this variation:

 
cd Recipes ;
for i in */ ; do
   echo "# ${i%/}" ;
   echo " " ;
   for j in $i/*.md ; do
      echo "## $(basename ${j%.md})" ;
      echo ;
      cat $i/$j ;
      echo ; 
   done ;
done ; \
|      \
pandoc                        \
  --toc                       \
  --number-sections           \
  --top-level-division=part   \
  --output=my-recipe-book.pdf \
  -

Best Answer

Related Solutions

Correctly sizing PNG images in markdown with pandoc for html/pdf/docx

Examples: dpi, width, height.

If you give it the dpi information:

If most of your pictures have a common height or width, that should be easily corrected.

Reference: Pandoc Readme

Compile PDF book from multiple Markdown files

Related Question