Sunday, 10 July 2016

Renaming ArXiv PDFs

When you download papers from arxiv.org, the assigned filenames are the unique ID of the publication, not the paper's title.  This makes it hard to browse the content, especially if you like to download several things to read later.

This quick bit of bash script looks through the current directory for any .pdf file whose name looks like an ArXiv ID, reads out the text and looks for a title row.  The text format isn't entirely uniform in all papers, so it needs to skip any very short rows, and any metadata rows.  Once an appropriate title is found, it's standardised to form a filename.

Since it may need to read through several rows, I've stored the contents in a file, but it might be tidier to do this in memory.
for file in ` ls . | grep "^[0-9\.]*.pdf$"`; do
    pdftotext $file temp.txt
    rowNum=1
    title=
    while [ ${#title} -lt 5 ] || [ ! -e $(echo $title | grep "arXiv") ]; do
      title=`sed "${rowNum}q;d" temp.txt | sed 's/[^A-Za-z0-9]/_/g'`
      title=${title:0:80}
      rowNum=$((rowNum+1))
    done
    mv $file $title.pdf
    rm temp.txt
  done