This quick bit of bash script looks through the current directory for any .pdf file whose name looks like an ArXiv ID, reads out the text and looks for a title row. The text format isn't entirely uniform in all papers, so it needs to skip any very short rows, and any metadata rows. Once an appropriate title is found, it's standardised to form a filename.
Since it may need to read through several rows, I've stored the contents in a file, but it might be tidier to do this in memory.
for file in ` ls . | grep "^[0-9\.]*.pdf$"`; do
pdftotext $file temp.txt
rowNum=1
title=
while [ ${#title} -lt 5 ] || [ ! -e $(echo $title | grep "arXiv") ]; do
title=`sed "${rowNum}q;d" temp.txt | sed 's/[^A-Za-z0-9]/_/g'`
title=${title:0:80}
rowNum=$((rowNum+1))
done
mv $file $title.pdf
rm temp.txt
done
No comments:
Post a Comment