How to sort and rename files in Python on Linux and use tesseract OCR to extract text to a file

In order to convert a series of images with text as an image in them to text files, the best tool is tesseract. It is quite old, but very reliable when it comes to looking for and extracting text from images.

After saving some screen shots as jpg files, it was found their names contained the date and time of saving. This information could be used to sequence the files by renaming them as 1.jpg, 2.jpg and so on after sorting the names. The below python script did just that:

import os

def main():
	rename_files()
	
def rename_files():
	folder = "."
	files = sorted(os.listdir(folder))
	count=0
	for i in files:
		count = count + 1
		dst = f"img{str(count)}.jpg"
		src = f"{folder}/{i}"
		dst = f"{folder}/{dst}"
		
		print(src, dst)
		
		os.rename(src, dst)

if __name__ == '__main__':
	# Call main() function
	main()

The next step was to run tesseract and convert any text in those image files to a text file. To do so, open a terminal window and enter the command given below. It extracts text from all files and puts it into one file outtext.

$ for i in *.jpg ; do tesseract $i stdout >> outtext; done; <enter>

Now outtext can be opened in any text editor and refined.

Related

Published by Rajan R. Vaswani