I have a lot of documents lying around. The dream is to digitise them all and get rid of the paper piles. The sheer amount, though, makes this very intimidating.
This project is the first step towards the long journey of organising documents.
I am not pretending that the bicycle does not exist here. There are many scanner apps available. The problem is that they always come short in at least one direction: some are slow, some inaccurate, some cannot handle many documents at once, some compress the files too much, etc.
The goal is to take photos with a phone (or anything, really, as long as we are not left at the mercy of a copier), detect the document, and crop the photo to content.
Input | Output |
---|---|
![]() |
![]() |
I have deliberately selected a setting where the object is not completely isolated. You can see parts of another document, my keyboard, some cables, etc. The important criteria for this algorithm is that the document is the largest white (light) shape on a dark background.
The first step is to remove some of the details. This can be done by morphological transformations: opening and closing (read more here). To perform these operations, we need to define kernels. The bigger the kernel, the bigger details it removes, but we need to be careful, since there are some features we want to keep (like the overall shape of the document). This is why I define three different sizes:
kernel = np.ones((7,7),np.uint8)
kernel_bigger = np.ones((50,50),np.uint8)
kernel_even_bigger = np.ones((5,900),np.uint8)
We first convert our image to grayscale, run the threshold filter (this maps every point to either black or white, depending on its current value), and perform consecutive morphological transformations:
Small kernel | Large kernel |
---|---|
![]() |
![]() |
After this, we can look for contours and choose the biggest bounding box.
Largest bounding box (green) | Cropped image |
---|---|
![]() |
![]() |
If you want, you can call this a day, but there is one more step we can do: perspective warping. For that, we need corners. Before we continue, though, we have one large detail to take care of (the logo on top). For that our medium-sized kernel was not enough, however, the bigger kernels will distort the overall shape of the document. Time to do the trick mentioned above: add some dark padding before the convolution and remove it afterward.
Extra dark padding | Large kernel convolution |
---|---|
![]() |
![]() |
After this, we detect the edges using Canny's algorithm and turn them to lines using Hough transform.
Canny edge detection | Hough lines |
---|---|
![]() |
![]() |
Since the edges of the document are not perfectly straight, we end up with multiple Hough lines. We can find all intersections (through away the ones if the intersecting angle is not close to 90 degrees), cluster them using k-means and define corners as the geometric centres of these clusters.
After we have the corners, we map them to corners of an A4 paper in 300 DPI.
Corners from clusters | Perspective cropping |
---|---|
![]() |
![]() |
The output image is ready for OCR (Optical Character Recognition) and some color-correction (such as brightness and contrast) in order to make it more readable.