PhotoRec Data Carving
Data carving is the process of extracting a collection of data from a larger data set. Data carving techniques frequently occur during a digital investigation when the unallocated file system space is analyzed to extract files. The files are "carved" from the unallocated space using file type specific header and footer values. File system structures are not used during the process. This is exactly how PhotoRec works.
Digital Forensics Research Workshop has issued a data carving challenge. The data set for this challenge is a 50MB raw file. It has no file system, but it contains JPEG, ZIP, HTML, Text, and Microsoft Office files and fragments. The goal is to extract as many full JPEG, ZIP, HTML, Text, and Office files as possible from it. Using this challenge as a test bed, PhotoRec has been improved to recover even more data than before.
Everyone is welcome to contribute to the project.
Data Recovery Process
PhotoRec
The first step has been to use PhotoRec. Version 6.5-WIP (WIP=Work In Progress) is considered. PhotoRec has scanned the image file for known headers and has successfully recognised all JPEG, OLE/Office, HTML and ZIP headers. There are no false positives.
The JPEG footer, used to determine the file size and validity of a recovered JPEG, is checked by PhotoRec using libjpeg. ZIP footers are detected but the file integrity isn't checked. OLE file format is very complex - its internals are similar to a file system but PhotoRec is able to get the file size by analyzing the FAT. Text files are hard to detect because there is no header. After a UTF8 to ASCII translation, PhotoRec calculates the index of coincidence to determine if a sector holds text or random data. There can be false positive if DOC or HTML files aren't well detected (i.e. fragmented data).
Manual recovery of remaining JPEG
PhotoRec can handle some form of data fragmentation in JPEG file. Using the libjpeg library, it's able to check recovered data. This way it was able to recover 9 JPEGs perfectly. Manual recovery was initiated to recover the remaining files. Using dd and PhotoRec, additional files have been recovered.
A picture of a hedgehog begins at sector 31475 and a picture from Mars begins at sector 31533. Extract from photorec.log:
31475-31532: jpg 31533-32836: jpg
The second picture begins while the first isn't finished - both pictures are corrupted.
Reading the photorec log file, we learn that the Mars picture is corrupted after about 118784 bytes (JPG error at offset 118784
). Let's try to find the exact data fragment size.
$ dd if=dfrws-2006-challenge.raw of=mars1.jpg skip=31533 count=`expr 118784 / 512` 232+0 records in 232+0 records out $ display mars1.jpg display: Corrupt JPEG data: premature end of data segment `mars1.jpg'. display: Corrupt JPEG data: premature end of data segment `mars1.jpg'.
The JPEG fragment is 232 sectors long but garbage can be seen at the end of the image - it means the fragment is too large.
By trial and error, it's possible to determine that the fragment is 220 sectors long.
$ dd if=dfrws-2006-challenge.raw of=mars2.jpg skip=31533 count=220 220+0 records in 220+0 records out $ display mars2.jpg display: Corrupt JPEG data: premature end of data segment `mars2.jpg'. display: Corrupt JPEG data: premature end of data segment `mars2.jpg'.
There is no garbage left in the picture.
31475-31532: jpg fragment, hedgehog 31533-31752: jpg fragment, mars 31753-32836 ?
$ dd if=dfrws-2006-challenge.raw skip=31475 count=`expr 31532 - 31475 + 1` > hedgehog.jpg 58+0 records in 58+0 records out $ dd if=dfrws-2006-challenge.raw skip=31753 count=`expr 32836 - 31753 + 1` >> hedgehog.jpg 1084+0 records in 1084+0 records out $ display hedgehog.jpg
Now, the exact file size can be found using PhotoRec on the recovered picture.
$ photorec hedgehog.jpg PhotoRec 6.4, Data Recovery Utility, June 2006 Christophe GRENIER <grenier@cgsecurity.org> https://www.cgsecurity.org Please wait... Disk hedgehog.jpg - 584 KB / 571 KiB - CHS 1 255 63 (RO), sector size=512 PhotoRec exited normally. $ ls -l recup_dir.1/f0.jpg -rw-rw-r-- 1 kmaster kmaster 98354 Jul 10 11:30 recup_dir.1/f0.jpg $ md5sum recup_dir.1/f0.jpg db89684c177168036e274140ecf766a1 recup_dir.1/f0.jpg
The picture size is 98354 (193 sectors). We can now recover the Mars picture.
$ expr 31753 + 193 - 58 31888 $ dd if=dfrws-2006-challenge.raw skip=31533 count=220 > mars3.jpg 220+0 records in 220+0 records out $ dd if=dfrws-2006-challenge.raw skip=31888 count=`expr 32836 - 31888 + 1` >> mars3.jpg
As seen before, it's possible to get the exact file size:
$ photorec mars3.jpg PhotoRec 6.4, Data Recovery Utility, June 2006 Christophe GRENIER <grenier@cgsecurity.org> https://www.cgsecurity.org Please wait... Disk mars3.jpg - 598 KB / 584 KiB - CHS 1 255 63 (RO), sector size=512 PhotoRec exited normally. $ ls -l recup_dir.2/ total 192 -rw-rw-r-- 1 kmaster kmaster 188693 Jul 10 11:47 f0.jpg $ md5sum recup_dir.2/f0.jpg 0915313e99af0f6bf13bc06bcd003113 recup_dir.2/f0.jpg
Manual recovery of zip files
Three zip files are recovered by PhotoRec but one of them is corrupted. A small Perl script was used to fix the zip file beginning at sector 45015 found by PhotoRec. Using unzip, this little Perl script locates and removes the extra sectors present in the file.
Manual recovery of XLS/Ole file
Office documents including Excel are using the OLE file format. A document has been identified at sector 2051 but this document hasn't been successfully recovered by PhotoRec (the file may be fragmented). A OLE file consists of a header structure and a list of all sectors following the header. In our case,
- the size of the sectors is 512 bytes.
- The SID (Sector Identifier) of the first sector of the directory stream is 1688.
- The master sector allocation table is using 14 sectors: 1673-1685,1689.
The directory stream, SID 3761, lists the following components
Workbook SID 0 size 848333 SummaryInformation SID 1657 size 4096 DocumentSummaryInformation SID 1690 size 4096
Sectors | Object | SID |
2051 | Header | N/A |
2052-?,?-3729 | Workbook | 0-1656 |
x-x+20 | 21 extra sectors, not XLS | N/A |
3730-3737 | SummaryInformation | 1657-1664 |
3746-3758,3762 | Allocation Table | 1673-1685,1689 |
3761 | RootDirectory | 1688 |
3763-3770 | DocumentSummaryInformation | 1690-1697 |
Unfortunatly, I have failed to locate the 21 extra sectors. Anyway, the latest version of OpenOffice has been able to open the corrupted file and display most of the data. A new version of the document can be found on fcc web site.
Disk Layout
Sectors | File type | Note |
0-8 | HTML (fragment) | Alice in Wonderland by Lewis Carroll |
9-44 | HTML | Alice in Wonderland by Lewis Carroll |
2051 | Office (fragment) | Excel, http://www.fcc.gov/Forms/Form477/477.xls |
3868-4428 | JPG | 640x481 Mars |
4436-4455 | HTML (end is missing) | A STUDY IN SCARLET by Sir Arthur Conan Doyle |
4456-4501 | HTML | Stave 1: Marley's Ghost by Charles Dickens |
4502-4556 | HTML (beginning is missing) | Stave 1: Marley's Ghost by Charles Dickens |
7964-8284 | Office | Upcoming Research Symposium 1/3 |
8285-9473 | JPG | www.dfrws.org/2004/photos/day2/rodeo1-3-dfrws2004.jpg |
9474 | Office | Upcoming Research Symposium 2/3 |
10031 | Office | Upcoming Research Symposium 3/3 |
11619-11822 | JPG | yeast 1/2 |
11823-11848 | Text | Moby Dick, Chapter i - LOOMINGS (page 1-6) |
11849-12017 | JPG | yeast 2/2 |
12222-26116 | JPG | DFRWS 2006 Forensics Challenge, 11598x11598 |
27496-27606 | HTML | The Comedy of Errors by Shakespeare, Act I, Scene I (1/2) |
27607-27977 | JPG | The porcupine |
27978-28196 | HTML | The Comedy of Errors by Shakespeare (2/2) |
28244-28245 | HTML | Moby Dick - chapter 134 (1/2) |
28246-28306 | Text (fragment) | De la division du travail social, Emile Durkheim |
28307-28344 | HTML | Moby Dick page - chapter 135 (2/2) |
28439-28726 | ZIP | Zip Ok |
28729-29528 | ZIP | ZIP 1/2 |
29529-29895 | HTML | The Tempest, Shakespeare |
29896-31368 | ZIP | ZIP 2/2 |
31475-31532 | JPG | A hedgehog (1/2) |
31533-31752 | JPG | Mars (1/2) |
31753-31887 | JPG | A hedgehog (2/2) |
31888-32036 | JPG | Mars (2/2) |
32837-33397 | Office | www.tsa.gov/public/interweb/assetlibrary/Permitted_Prohibited_Facts.doc |
34288-34306 | Office | "Reports on Computer Systems Technology"
[http://csrc.nist.gov/publications/nistpubs/800-26/sp800-26.doc 1/2 |
34307-34412 | Text | The Adventure of the Copper Beeches |
34413-36236 | Office | http://csrc.nist.gov/publications/nistpubs/800-26/sp800-26.doc 2/2 |
36292-36640 | JPG | ? |
36998-37649 | Office | PREVENTING CRIME: WHAT WORKS, WHAT DOESN'T, WHAT'S PROMISING
www.ncjrs.org/docfiles/wholedoc.doc 1/3 |
37727-39427 | Office | www.ncjrs.org/docfiles/wholedoc.doc 2/3 |
39477-40380 | Office | www.ncjrs.org/docfiles/wholedoc.doc 3/3 |
40638-41219 | JPG | www.dfrws.org/2004/photos/day2/rodeo1-breaf-dfrws2004.jpg 1/2 |
41239-41609 | JPG | www.dfrws.org/2004/photos/day2/rodeo1-breaf-dfrws2004.jpg 2/2 |
41611-43433 | JPG | http://imgsrc.hubblesite.org/hu/db/2006/10/images/a/formats/1280_wallpaper.jpg (1/2) |
43434-44028 | JPG | www.dfrws.org/2004/photos/day2/rodeo1-dfrws2004.jpg |
44029-44200 | JPG | http://imgsrc.hubblesite.org/hu/db/2006/10/images/a/formats/1280_wallpaper.jpg (2/2) |
45015-45386 | ZIP | Zip 1/2 |
45390-45545 | ZIP | Zip 2/2 |
45566-45963 | JPG | U. S. Geological Survey Open-File Report 01-154
Slope off Florida Keys https://pubs.usgs.gov/of/2001/of01-154/data/bphotos/1565.jpg 1/2 |
45964-46103 | Office | Farm Credit System Insurance Corporation
Statement of Financial Condition
March 31, 2006 and December 31, 2005 |
46104-46826 | JPG | https://pubs.usgs.gov/of/2001/of01-154/data/bphotos/1565.jpg 2/2 |
46910-94836 | JPG | DFRWS 2006 Forensics Challenge, 8640x8640 |
94846-95628 | JPG | Saturn http://imgsrc.hubblesite.org/hu/db/2001/15/images/a/formats/full_jpg.jpg (1/2) |
95630-96653 | JPG | Saturn http://imgsrc.hubblesite.org/hu/db/2001/15/images/a/formats/full_jpg.jpg (2/2) |
Files
File type | File size (in bytes) | MD5 hash | Sectors | Note | PhotoRec Score |
HTML (fragment) | 4608 | ec89111e45da8265b641655d0f68725e | 0-8 | Alice in Wonderland by Lewis Carroll | 5 |
HTML | 18147 | eec87931b03e5a4a4ef8fd51109a1227 | 9-44 | Alice in Wonderland by Lewis Carroll | 5 |
Office | 869888 | ? | ~ 2051-3770 (21 extra sectors) | http://www.fcc.gov/Forms/Form477/477.xls | 1 |
JPG | 287186 | daf4205574abd6919b10ca8be92d17a3 | 3868-4428 | 640x481 Mars | 5 |
HTML (end is missing) | 10240 | 799ad2d2f2f1f17657338d98c97559c4 | 4436-4455 | A STUDY IN SCARLET by Sir Arthur Conan Doyle | 5 |
HTML | 23544 | f4481ed348d3d59c5dad80afeb0341f9 | 4456-4501 | Stave 1: Marley's Ghost by Charles Dickens | 5 |
HTML (beginning is missing) | 27875 | baf8b811ee9502408f9f0e73efa77cf0 | 4502-4556 | Stave 1: Marley's Ghost by Charles Dickens | 5 |
Office | 450048 | 8d2a9a284e078805ada47db191f35244 | 7964-8284, 9474-10031 | Upcoming Research Symposium | 5 |
JPG | 608703 | 4efc6c572683878efd8f3404ddaded7b | 8285-9473 | www.dfrws.org/2004/photos/day2/rodeo1-3-dfrws2004.jpg | 5 |
JPG | 190720 | 7b07320709e0caa947663f5df3a0a390 | 11619-11822, 11849-12017 | yeast | 5 |
Text | 12826 | f800a46e18fafd309825c5ee84a654a2 | 11823-11848 | Moby Dick, Chapter i - LOOMINGS (page 1-6) | 3 |
JPG | 7113968 | b070beae1606f67a342bc5f78c29c743 | 12222-26116 | DFRWS 2006 Forensics Challenge, 11598x11598 | 5 |
HTML | 168525 | 1959aa0391664b60fd0f2e64ed7a22f4 | 27496-27606, 27978-28196 | The Comedy of Errors by Shakespeare, Act I, Scene I | 2 |
JPG | 189534 | fe7e7ac67709f2d9c2483aa98c681b99 | 27607-27977 | The porcupine | 5 |
HTML | 20019 | 045798407b927321326a547704e67831 | 28244-28245, 28307-28344 | Moby Dick - chapter 134 and 135 | 2 |
Text (fragment) | 30816 | 616a6bbe915c3dbf51014fd76f55b0e3 | 28246-28306 | De la division du travail social, Emile Durkheim | 0 |
ZIP | 147150 | ebabde39ba44d38888dd82606980498a | 28439-28726 | Zip Ok | 5 |
ZIP | 1163745 | 9a4c2d3a9bd203eb39c9f954a3c997e4 | 28729-29528, 29896-31368 | ZIP | 5 |
HTML | 187793 | 158496c522d97b7389c9907cae777ac1 | 29529-29895 | The Tempest, Shakespeare | 5 |
JPG | 98354 | db89684c177168036e274140ecf766a1 | 31475-31532, 31753-31887 | A hedgehog | 2 |
JPG | 188693 | 0915313e99af0f6bf13bc06bcd003113 | 31533-31752, 31888-32036 | Mars | 2 |
Office | 287232 | 0e52e75029e99cd2e9dcd0af271cf4a2 | 32837-33397 | www.tsa.gov/public/interweb/assetlibrary/Permitted_Prohibited_Facts.doc | 5 |
Office | 943616 | d7ff92b8cc1c89c46a78288b9c673152 | 34288-34306, 34413-36236 | http://csrc.nist.gov/publications/nistpubs/800-26/sp800-26.doc | 2 |
Text | 53870 | 5a12ef9dba88a186ef18a5d349b28e37 | 34307-34412 | The Adventure of the Copper Beeches | 3 |
JPG | 178659 | 2fae8770cc013d22e9ea1c070f2f509b | 36292-36640 | ? | 5 |
Office | 1667584 | 4a22f04b097920d11fff4e192e0667a4 | 36998-37649, 37727-39427, 39477-40380 | PREVENTING CRIME: WHAT WORKS, WHAT DOESN'T, WHAT'S PROMISING
www.ncjrs.org/docfiles/wholedoc.doc |
2 |
JPG | 487473 | f8c51e0688796b5d616f0e5d4a94d104 | 40638-41219, 41239-41609 | www.dfrws.org/2004/photos/day2/rodeo1-breaf-dfrws2004.jpg | 2 |
JPG | 1021085 | 7cce072e518fd72484c97adb1b4be08e | 41611-43433, 44029-44200 | http://imgsrc.hubblesite.org/hu/db/2006/10/images/a/formats/1280_wallpaper.jpg | 5 |
JPG | 304413 | c0da37b3f1a07af790e6e9171cedc4d2 | 43434-44028 | www.dfrws.org/2004/photos/day2/rodeo1-dfrws2004.jpg | 5 |
ZIP | 270181 | f940fcc37c82e8ff1431e5c3c061611e | 45015-45386, 45390-45545 | Zip | 2 |
JPG | 573499 | 2320fe9c41eaddb864a56c2ddc4dd186 | 45566-45963, 46104-46826 | U. S. Geological Survey Open-File Report 01-154
Slope off Florida Keys http://pubs.usgs.gov/of/2001/of01-154/data/bphotos/1565.jpg |
5 |
Office | 71680 | 109284cc5abddc83879a29785795fd75 | 45964-46103 | Farm Credit System Insurance Corporation
Statement of Financial Condition
March 31, 2006 and December 31, 2005 |
5 |
JPG | 24538540 | db32b271506b2f4974791957627c61cc | 46910-94836 | DFRWS 2006 Forensics Challenge, 8640x8640 | 5 |
JPG | 924877 | 1a5a843000ef617af93a9cad645e3cdf | 94846-95628, 95630-96653 | Saturn http://imgsrc.hubblesite.org/hu/db/2001/15/images/a/formats/full_jpg.jpg | 1 |
PhotoRec Score Legend:
0 File not found 1 First sector identified 2 + correct file type 3 + all sectors identified 4 + correct file size 5 + correct checksum
Conclusion
PhotoRec has been able to retrieve most files automatically. Results can still be improved by brute forcing a JPG fragment location or adding some JPG search-only phases but this can be time-consuming. Thanks to
- Daniel Sedory for letting me know about this contest and his long time involvement in TestDisk/PhotoRec project
- the following ESIEA students: Gregory BLANC, Fabien BOUFFARD, Hicham CHAARAOUI, Karim EL FILALI, Amine HASSANI, Igor VALLEE for their work on OLE file format.