German Documents in Russia - Bulk Download

Hoplophile · #1

What follows is a description of a technique that I have used to successfully download, in bulk, the many JPEG images that make up each of the multi-page digital documents posted on the German Documents in Russia website.

http://www.germandocsinrussia.org/

I used this technique on a MacBook Pro computer running OS 10.10.5 (Yosemite) and the current (as of 18 December 2016) version of Google's "Chrome" browser.

I began by adding an extension to Chrome called "Chrono Download." I found this extension on the Chrome extensions section of the Google webstore. https://chrome.google.com/webstore/category/extensions

Once I had installed Chrono Download, I opened it, and got a page with the following toolbar along its top edge.

: Chrono Download Toolbar.png (16.44 KiB) Viewed 2660 times

I clicked on the icon that looked like a pair of gear wheels. This took me to the "options" page.

On the options page, I set "concurrent downloads" to "1." (This will ensure that each image is downloaded before the next in the series. This, in turn, enabled me to use time downloaded as a proxy for page number.)

: Concurrent Downloads.png (15 KiB) Viewed 2660 times

I then returned to the main page of the extension, and clicked the green cross icon. This brought forth a box marked "New Task."

I then went to the German Documents in Russia website and selected the document I wished to download. (In this case, it was a war diary for a siege mortar battery from the First World War.)

http://tsamo.germandocsinrussia.org/de/ ... rid/zoom/1

I clicked the thumbnail for the first of the 243 images in the set.

http://tsamo.germandocsinrussia.org/de/ ... ect/zoom/4

I clicked the "printer" icon to bring forth the "printable" version of the image.

http://tsamo.germandocsinrussia.org/pages/33251/zooms/8

I made a note of the number that appeared between the words "pages" and "zooms" on the URL. I also made sure to copy the URL of the "printable" version of the image.

I added the number of pages in the document (243) to the number between "pages" and "zooms" (33251.) This yielded 33494.

I subtracted one (1) from this number, getting 33493.

(Another way to get the second number is to make a printable version of the last image of the document and look at the serial number that appears between "pages" and "zooms.")

I returned to the Chrono Download extension and pasted the URL of the "printable" version of the image in the section of the "New Task" box marked "URL."

I modified this URL by replacing the original number between "pages" and "zooms" (33251) with the following expression: [33251:33493]

(The first number in the expression was the serial number of the first image in the document. The second number was the serial number of the last image in the document.)

The modified URL thus looked like this:

http://tsamo.germandocsinrussia.org/pages/[33251:33493]/zooms/8

I then hit start, and watched as the extension downloaded each of the images.

A few seconds later, I checked my downloads folder and found that the images were all there, but in reverse chronological order.

I transferred the files from my downloads folder to a folder of their own.

Within that folder, I ensured that the image files were in chronological order. (I did that by clicking on "date modified" to reverse the order in which they appeared.) That done, I used the "rename" feature to provide each with a page number that correlated with its place in the document.

(For some reason that I have yet to fathom, the first image of the document - the cover page - ended up at the end of the series. I remedied this by renaming it with a number that put it at the front of the series.)

Please note that I have not tried this technique with other browsers, operating systems, or websites.

Please note that this technique is the product of speculation, trial, and error on the part of someone whose last experience of coding was in 1984 (in Basic!) Thus, I cannot explain how this works.

Hoplophile · #2

I have just discovered that this technique does not work for every document in this collection. To be more specific, I have just run into a document in which there are irregular gaps between the serial numbers. That is, rather than being 2697, 2698, 2699 ... the series is 2697, 2701, 2705, 2709, 2714, 2720 ...

These gaps seem to be a function of the way that the images were scanned. That is, the technician seems to have scanned the covers of four related documents before moving on to the second pages of those documents and so forth.

So, before using the technique described in the previous post, I recommend checking the serial number of the printable version of the last image in the series. If this is equal the number of images in the document plus the serial number of the printable version of the first image minus one, then the aforementioned technique should work. If, however, the number of the last image is larger, then there are gaps that will complicate the downloading to the point where the use of this technique offers little or no advantage over the downloading of each image by hand.

Hoplophile · #3

The good news is that I continue to enjoy success with the bulk download technique described in the first post of this thread. That is, I continue to find documents in which there are no gaps between the serial numbers of images.

Another bit of good news is that this technique can be used to download the images of several documents at once. This is of use in those cases where the documents in question were scanned in groups of three, four, or five. To be more specific, I used this technique to download the images associated with a set of four notebooks, each of which recorded the consumption of artillery ammunition on a given day of the battle of Verdun.

http://tsamo.germandocsinrussia.org/de/ ... rid/zoom/1

http://tsamo.germandocsinrussia.org/de/ ... rid/zoom/1

http://tsamo.germandocsinrussia.org/de/ ... rid/zoom/1

http://tsamo.germandocsinrussia.org/de/ ... rid/zoom/1

The images made from the pages of these notebooks were scanned according to this pattern: cover of first notebook, cover of second notebook, cover of third notebook, cover of fourth notebook, first page of first notebook, first page of second notebook, and so forth. Unfortunately, this pattern is not entirely regular. Thus, there is no point in creating a URL that exploits regular intervals to link only to every fourth image in a series.

(I am sorry to report that the first volume of this set is missing from the on-line collection.

)

Hoplophile · #4

When creating the expression that describes the range of images to download, I found that it is better to put the higher serial number in the first position and the lower in the second position. This leads to a situation in which the images are downloaded in their normal order. That is, the order of downloading follows the sequence of the serial numbers.

For example, rather than using this expression in the URL [28768:28938], use this expression [28938:28768]

Unfortunately this does not solve the problem of the 'misplaced first image'. Rather, it is now the last image that is misplaced. That is, it will be downloaded first.

Hoplophile · #5

Yesterday, I ran into a document in which some of the images were in a gap-free sequence (and thus suitable for bulk downloading) while others were in a sequence with large, irregular gaps. The adventure continues!

Richard Hedrick · #6

Hello,

Thanks for posting this technique, i found it useful. I had made attempts early on when the site first started, then they made changes and made things more difficult so i lost interest. The gaps are an issue since, as you said, they are irregular. If you figure anything out let us know. I also wonder about file names of the jpg's. Are they completely random or the result of some algorithm.

At any rate, much better than one at a time so thanks again.

cheers,
Richard

Hoplophile · #7

Hello, Richard (if I may),

Before I try to answer your question of the file names of the images that make up each of the documents made available on "German Documents in Russia", let me thank you for all of the fine work that you do on your website. To use a phrase traditional in the American Sea Services: Bravo Zulu!

Now for the numbers ...

I just did a little experiment with the file names associated with an eight-page document from "German Documents in Russia."

Here is a link to the document in question. (It's a map problem dating from February 1939.)

http://wwii.germandocsinrussia.org/de/n ... rid/zoom/1

Here is a screenshot that shows the eight images that make up the document.

Here is a table that shows the two designators for each of the images in the document. The first (called "download" in the table) is a forty-character name for each of the images downloaded. The second (called "print") is a four-digit serial number from the URL that appears when I put each image in "print mode."

As you can see, while that four-digit serial numbers from a regular series, the forty-character image names follow no easily discernible pattern.

By the way, I've repeated this download three times. In each of those instances, the forty-character image names remained the same. So, I think it is reasonable to infer that, rather than being assigned in the downloading process, these image names precede that process.

Cheers,

Bruce

Richard Hedrick · #8

Hey Bruce,

Thanks for the reply and the additional info. I came to the same conclusion that each image has a very specific file name. What I am wondering is when the file names were originally created, were they generated randomly or based on an algorithm that takes into account the document group (12451), the document (418) and the page/image number. The idea would be that you could download the full range of images but then run the file names through a decrypting process that would then change the 40 character string to something more readable that would also allow each document to sort together.

So below, the first three lines is what a sortable file name would look like, representing group number – document number – page number. The next three lines are the same strings encrypted using Rijindael-256. These were encrypted using Base 64 and it appear their file names are encrypted using Hexadecimal since the base character set seems to be Hex. So basically you are left with figuring out what algorithm and key they used and possibly several other factors.

12451-0418-0001
12451-0418-0002
12451-0418-0003

KBTYKxQraHCa0WDinmZLD9HbHbchg7NTaIF3woyl3ts=
aQP4ujY0MI8jV/crT43hFxmjbbcwcDon+ZXrE4TNqPc=
L4Q91HDl5v0GHBENxxpHmp7nLEufBHzmBMtcdMo3y+I=

At any rate, not that it would likely be worth the effort but I am not smart enough to figure this out anyways. It just seems an interesting problem to solve.

Another issue is the resolution that can be downloaded. The level 8 resolution that you show in your process is the best I have been able to download as well. There is also a level 9 (can be seen using he zoom view) that provides a significant improvement in resolution which is really nice for maps and such. However when you try to download/print in this resolution it take you to a login page. Apparently you have to be an authenticated user in order to print at that resolution.

Also for anyone that is not aware many of these documents can be downloaded from John Calvin’s FTP. Not sure at what resolution they were downloaded at but it is not as good as level 8.

Cheers,
Richard

Hoplophile · #9

This is intriguing. I suspect, however, that are right about the amount of time needed to learn enough about encryption to decode the file names. Downloading the trickier documents page-by-page is a chore, but at least it is a mindless one.

Still, I wonder why there are so many numbers in the file names. If the characters were assigned randomly, I would expect a larger number of letters and a smaller number of numbers.

GregSingh · #10

If the characters were assigned randomly, I would expect a larger number of letters and a smaller number of numbers.

Not with the Hex numerical system?

Der Alte Fritz · #11

Do you think that this technique would work with the Pamyat-Naroda site?

Also has anyone seen an index to the German Documents site?

Axis History Forum

German Documents in Russia - Bulk Download

German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download

Re: German Documents in Russia - Bulk Download