gscan2pdf - A GUI to produce PDFs or DjVus from scanned documents
Screenshot: Main page v2.4.0
None
gscan2pdf has the following command-line options:
Specifies the device to use, instead of getting the list of devices from via the SANE API. This can be useful if the scanner is on a remote computer which is not broadcasting its existence.
Displays this help page and exits.
Specifies a file to store logging messages.
Defines the log level. If a log file is specified, this defaults to --debug, otherwise --error.
Imports the specified file(s). If the document has more than one page, a window is displayed to select the required pages.
Displays the program version and exits.
Scanning is handled with SANE via scanimage. PDF conversion is done by PDF::Builder. TIFF export is handled by libtiff (faster and smaller memory footprint for multipage files).
To diagnose a possible error, start gscan2pdf from the command line with logging enabled:
gscan2pdf --log=file.log
and check file.log.
None
gscan2pdf creates a text resource file in ~/.config/gscan2pdfrc. The directory can be changed by setting the $XDG_CONFIG_HOME variable. Generally, however, preferences should be changed via the Edit/Preferences menu, or are captured automatically during normal usage of the program.
None known.
Whilst it is possible to import PDFs, this is intended to be able to round-trip files created by gscan2pdf.
gscan2pdf is available on Sourceforge (https://sourceforge.net/projects/gscan2pdf/files/gscan2pdf/).
If you are using Debian, you should find that sid has the latest version already packaged.
If you are using a Ubuntu-based system, you can automatically keep up to date with the latest version via the ppa:
sudo apt-add-repository ppa:jeffreyratcliffe/ppa
If you are you are using Synaptic, then use menu Edit/Reload Package Information, search for gscan2pdf in the package list, and lo and behold, you can install the nice shiny new version.
From the command line:
sudo apt update
sudo apt install gscan2pdf
The source is hosted in the files section of the gscan2pdf project on Sourceforge (https://sourceforge.net/projects/gscan2pdf/files/).
gscan2pdf uses Git for its Revision Control System. You can browse the tree at https://sourceforge.net/p/gscan2pdf/code/.
Git users can clone the complete tree with git clone git://git.code.sf.net/p/gscan2pdf/code
Having downloaded the source either from a Sourceforge file release, or from the Git repository, unpack it if necessary with tar xvfz gscan2pdf-x.x.x.tar.gz cd gscan2pdf-x.x.x
perl Makefile.PL
, will create the Makefile.
make test
should run several hundred tests to confirm that things will work properly on your system.
You can install directly from the source with make install
, but building the appropriate package for your distribution should be as straightforward as make debdist
or make rpmdist
. However, you will additionally need the rpm, devscripts, fakeroot, debhelper and gettext packages.
The list below looks daunting, but all packages are available from any reasonable up-to-date distribution. If you are using Synaptic, having installed gscan2pdf, locate the gscan2pdf entry in Synaptic, right-click it and you can install them under Recommends. Note also that the library names given below are the Debian/Ubuntu ones. Those distributions using RPM typically use perl(module) where Debian has libmodule-perl.
There is a bug in version of libgtk3-perl before 0.028 that causes gscan2pdf to crash when saving. Whilst I could prevent gscan2pdf from crashing, it would still be impossible to save anything, rendering gscan2pdf rather useless.
A simple interface to Gtk3's complex MVC list widget
Using libc functions for internationalisation in Perl
provides the functions for creating PDF documents in Perl
API library for scanners
Perl bindings for libsane.
manages sets of integers
TIFF manipulation and conversion tools
Image manipulation programs
A perl interface to the libMagick graphics routines
API library for scanners -- utilities.
scanner graphical frontends. Only required for the scanadf frontend.
post-processing tool for scanned pages. See https://www.flameeyes.eu/projects/unpaper.
Desktop integration utilities from freedesktop.org. Required for Email as PDF. See https://www.freedesktop.org/wiki/Software/xdg-utils/
Utilities for the DjVu image format. See http://djvu.sourceforge.net/
A command line OCR. See http://jocr.sourceforge.net/.
A command line OCR. See https://github.com/tesseract-ocr/tesseract
A command line OCR. See http://launchpad.net/cuneiform-linux
There are two mailing lists for gscan2pdf:
A low-traffic list for announcements, mostly of new releases. You can subscribe at https://lists.sourceforge.net/lists/listinfo/gscan2pdf-announce
General support, questions, etc.. You can subscribe at https://lists.sourceforge.net/lists/listinfo/gscan2pdf-help
Before reporting bugs, please read the "FAQs" section.
Please report any bugs found, preferably against the Debian package[1][2]. You do not need to be a Debian user, or set up an account to do this. The Debian tool "reportbug" provides a convenient GUI for doing so.
Alternatively, there is a bug tracker for the gscan2pdf project on Sourceforge (https://sourceforge.net/p/gscan2pdf/_list/tickets?source=navbar).
Please include the log file created by gscan2pdf --log=log
with any new bug report.
gscan2pdf has already been partly translated into several languages. If you would like to contribute to an existing or new translation, please check out Rosetta: https://translations.launchpad.net/gscan2pdf
Note that the translations for the scanner options are taken directly from sane-backends. If you would like to contribute to these, you can do so either at contact the sane-devel mailing list (sane-devel@lists.alioth.debian.org) and have a look at the po/ directory in the source code http://www.sane-project.org/cvs.html.
Alternatively, Ubuntu has its own translation project. For the 9.04 release, the translations are available at https://translations.launchpad.net/ubuntu/jaunty/+source/sane-backends/+pots/sane-backends
If you have updated an .po
file in the po
directory of the gscan2pdf source tree and would like to test it, pick a test directory for the compiled locales, e.g. ./locale
, and create the .mo
files with:
perl Makefile.PL LOCALEDIR=./locale
If the updated locale is your standard one, then the following will find the updated file:
perl -I lib bin/gscan2pdf --log=log --locale=locale
If it is not your standard locale, you will need something like (for Russian):
LC_ALL=ru_RU.utf8 LC_MESSAGES=ru_RU.utf8 LC_CTYPE=ru_RU.utf8 LANG=ru_RU.utf8 LANGUAGE=ru_RU.utf8 perl -I lib bin/gscan2pdf --log=log --locale=locale
or German:
LC_ALL=de_DE LC_MESSAGES=de_DE LC_CTYPE=de_DE LANG=de_DE LANGUAGE=de_DE perl -I lib bin/gscan2pdf --log=log --locale=locale
If the above doesn't work, make sure it is in the list produced by locale -a
, including any .utf8
suffix. If necessary, generate new locales with sudo dpkg-reconfigure locales
Clears the page list.
Opens any format that imagemagick supports. PDFs will have their embedded images extracted and imported one per page.
Note that files can also be imported by dragging them into the thumbnail list from a program like nautilus or konqueror.
Sets options before scanning via SANE.
Chooses between available scanners.
Selects the number of pages, or all pages to scan.
Selects between single sided or double sides pages.
This affects the page numbering. Single sided scans are numbered consecutively. Double sided scans are incremented (or decremented, see below) by 2, i.e. 1, 3, 5, etc..
If double sided is selected above, assuming a non-duplex scanner, i.e. a scanner that cannot automatically scan both sides of a page, this determines whether the page number is incremented or decremented by 2.
To scan both sides of three pages, i.e. 6 sides:
# Pages = 3 (or "all" if your scanner can detect when it is out of paper)
Double sided
Facing side
# Pages = 3 (or "all" if your scanner can detect when it is out of paper)
Double sided
Reverse side
These, naturally, depend on your scanner. They can include
Guarantees that a "no documents" condition will be returned after the last scanned page, to prevent endless flatbed scans after a batch scan.
After sending the scan command, wait until the button on the scanner is pressed before actually starting the scan process.
Selects the document source. Possible options can include Flatbed or ADF. On some scanners, this is the only way of generating an out-of-documents signal.
Saves the selected or all pages as a PDF, DjVu, TIFF, PNG, JPEG, PNM or GIF.
Metadata are information that are not visible when viewing the PDF/DjVu, but are embedded in the file and so searchable and can be examined, typically with the "Properties" option of the document viewer.
The metadata are completely optional, but can also be used to generate the filename see preferences for details.
The date can be selected with use of the calendar widget. The displayed date can be incremented or decremented with use of the '+' and '-' keys.
Both black and white, and colour images produce better compression than PDF. See http://www.djvuzone.org/ for more details.
Attaches the selected or all pages as a PDF to a blank email. This requires xdg-email, which is in the xdg-utils package. If this is not present, the option is ghosted out.
Prints the selected or all pages.
If your temporary ($TMPDIR) directory is getting full, this function can be useful - compressing all images at LZW-compressed TIFFs. These require much less space than the PNM files that are typically produced by SANE or by importing a PDF.
Deletes the selected page.
Renumbers the pages from 1..n.
Note that the page order can also be changed by drag and drop in the thumbnail view.
The select menus can be used to select, all, even, odd, blank, dark or modified pages. Selecting blank or dark pages runs imagemagick to make the decision. Selecting modified pages selects those which have modified by threshold, unsharp, etc., since the last OCR run was made.
When an image is scanned, gscan2pdf attempts to extract the resolution from the scan options. This nearly always works without problem.
Importing an image can be trickier, however. Some image formats such as PNM do not encode metadata for resolution. In other cases, the data is incorrect. Edit/Properties allows the user to manually correct the metadata for a particular page, thus correcting the size of final PDF or DjVu. The image itself is otherwise not changed - it is not down- or upscaled.
The preferences menu item allows the control of the default behaviour of various functions. Most of these are self-explanatory.
gscan2pdf initially supported two frontends, scanimage and scanadf. scanadf support was added when it was realised that scanadf works better than scanimage with some scanners. On Debian-based systems, scanadf is in the sane package, not, like scanimage, in sane-utils. If scanadf is not present, the option is obviously ghosted out.
In 0.9.27, Perl bindings for SANE were introduced. These are called libsane-perl.
Before 1.2.0, options available through CLI frontends like scanimage were made visible as users asked for them. In 1.2.0, all options can be shown or hidden via Edit/Preferences, along with the ability to specify which options trigger a reload.
In 1.8.3, New Perl bindings for SANE were introduced. These are called libimage-sane-perl and are the preferred frontend.
In 1.8.5, support for libsane-perl was removed.
Ignore listed devices.
Note that this is a device name regular expression, e.g. /dev/video, and not the name as listed in the scan window, e.g. Noname Integrated_Webcam_HD.
All strftime codes (e.g. %Y for the current year) are available as variables, with the following additions:
author
filename extension
title
All document date codes use strftime codes with a leading D, e.g.:
document year
document month
document day
Zooms to 1:1. How this appears depends on the desktop resolution.
Scales the view such that all the page is visible.
The rotate options require the package imagemagick and, if this is not present, are ghosted out.
Changes all pixels darker than the given value to black; all others become white.
The unsharp option sharpens an image. The image is convolved with a Gaussian operator of the given radius and standard deviation (sigma). For reasonable results, radius should be larger than sigma. Use a radius of 0 to have the method select a suitable radius.
unpaper (see https://www.flameeyes.eu/projects/unpaper) is a utility for cleaning up a scan.
The gocr, tesseract or cuneiform utilities are used to produce text from an image.
There is an OCR output buffer for each page and is embedded as plain text behind the scanned image in the PDF produced. This way, Beagle can index (i.e. search) the plain text.
In DjVu files, the OCR output buffer is embedded in the hidden text layer. Thus these can also be indexed by Beagle.
There is an interesting review of OCR software at https://web.archive.org/web/20080529012847/http://groundstate.ca/ocr. An important conclusion was that 400ppi is necessary for decent results.
Up to v2.04, the only way to tell which languages were available to tesseract was to look for the language files. Therefore, gscan2pdf checks the path returned by:
tesseract '' '' -l ''
If there are no language files in the above location, then gscan2pdf assumes that tesseract v1.0 is installed, which had no language files.
The following variables are available:
input filename
output filename
resolution
An image can be modified in-place by just specifying %i.
Possibly because SANE or your scanner doesn't support it.
If an option listed in the output of scanimage --help
that you would like to use isn't available, send me the output and I will look at implementing it.
In Edit/Preferences, tick the box "Allow batch scanning from flatbed".
Some Brother scanners report "out of documents", despite scanning from flatbed. This can be worked around by ticking the box "Force new scan job between pages".
If you are lucky, you have an option like Wait-for-button or Button-wait, where the scanner will wait for you to press the scan button on the device before it starts the scan, allowing you to scan multiple pages without touching the computer.
If you are quick, you might be able to change the document on the flatbed whilst the scan head is returning.
Otherwise, you have to set the number of pages to scan to 1 and hit the scan button on the scan window for each page.
Probably because the package required for that option is not installed. Email as PDF requires xdg-email (xdg-utils), unpaper and the rotate options require imagemagick.
Generally for HP scanners with an ADF, to scan from the flatbed, you should set "# Pages" to "1", and possibly "Batch scan" to "No".
As far as I can tell, this is pulled from changelogs.ubuntu.com, and therefore only the changelogs from official Ubuntu builds are displayed.
If your scanner is not connected directly to the machine on which you are running gscan2pdf and you have not installed the SANE daemon, saned, gscan2pdf cannot automatically find it. In this case, you can specify the scanner device on the command line:
gscan2pdf --device <device
>
pdftotext or djvutxt can extract the text layer from PDF or DJVU files. See the respective man pages for details.
Having opened a PDF or DJVU file in evince or Acrobat Reader, the search function will typically find the page with the requested text and highlight it.
There are various tools for searching or indexing files, including PDF and DJVU:
(meta) Tracker (https://projects.gnome.org/tracker/)
plone (http://plone.org/)
pdfgrep (http://pdfgrep.sourceforge.net/
swish-e (http://www.swish-e.org/)
terrier (http://www.lesbonscomptes.com/recoll/)
Create a file called ~/.config/gtk-3.0/gtk.css
with the following content:
.rubberband,
rubberband,
flowbox rubberband,
treeview.view rubberband,
.content-view rubberband,
.content-view .rubberband {
border: 1px solid #2a76c6;
background-color: rgba(42, 118, 198, 0.2); }
Create a file called ~/.config/gtk-3.0/gtk.css
with the following content:
#gscan2pdf-ocr-output {
color: black;
}
XSane (http://xsane.org/)
Scan Tailor (http://scantailor.org/)
Jeffrey Ratcliffe (jffry at posteo dot net)
all the people who have sent patches, translations, bugs and feedback.
the gtk+ project for a most excellent graphics toolkit.
the Gtk3-Perl project for their superb Perl bindings for GTK3.
The SANE project for scanner access
Björn Lindqvist for the gtkimageview widget
Sourceforge for hosting the project.
Copyright (C) 2006--2024 Jeffrey Ratcliffe <jffry@posteo.net>
This program is free software: you can redistribute it and/or modify it under the terms of the version 3 GNU General Public License as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.