PhotoRec Data Carving

From CGSecurity
Jump to navigation Jump to search

Data carving is the process of extracting a collection of data from a larger data set. Data carving techniques frequently occur during a digital investigation when the unallocated file system space is analyzed to extract files. The files are "carved" from the unallocated space using file type specific header and footer values. File system structures are not used during the process. This is exactly how PhotoRec works.

Digital Forensics Research Workshop has issued a data carving challenge. The data set for this challenge is a 50MB raw file. It has no file system, but it contains JPEG, ZIP, HTML, Text, and Microsoft Office files and fragments. The goal is to extract as many full JPEG, ZIP, HTML, Text, and Office files as possible from it. Using this challenge as a test bed, PhotoRec has been improved to recover even more data than before.

Everyone is welcome to contribute to the project.

Data Recovery Process

PhotoRec

The first step has been to use PhotoRec. Version 6.5-WIP (WIP=Work In Progress) is considered. PhotoRec has scanned the image file for known headers and has successfully recognised all JPEG, OLE/Office, HTML and ZIP headers. There are no false positives.

The JPEG footer, used to determine the file size and validity of a recovered JPEG, is checked by PhotoRec using libjpeg. ZIP footers are detected but the file integrity isn't checked. OLE file format is very complex - its internals are similar to a file system but PhotoRec is able to get the file size by analyzing the FAT. Text files are hard to detect because there is no header. After a UTF8 to ASCII translation, PhotoRec calculates the index of coincidence to determine if a sector holds text or random data. There can be false positive if DOC or HTML files aren't well detected (i.e. fragmented data).

Manual recovery of remaining JPEG

PhotoRec can handle some form of data fragmentation in JPEG file. Using the libjpeg library, it's able to check recovered data. This way it was able to recover 9 JPEGs perfectly. Manual recovery was initiated to recover the remaining files. Using dd and PhotoRec, additional files have been recovered.

A picture of a hedgehog begins at sector 31475 and a picture from Mars begins at sector 31533. Extract from photorec.log:

31475-31532: jpg
31533-32836: jpg

Hedgehog1 small.jpg Mars1 small.jpg

The second picture begins while the first isn't finished - both pictures are corrupted. Reading the photorec log file, we learn that the Mars picture is corrupted after about 118784 bytes (JPG error at offset 118784). Let's try to find the exact data fragment size.

$ dd if=dfrws-2006-challenge.raw of=mars1.jpg skip=31533 count=`expr 118784 / 512`
232+0 records in
232+0 records out
$ display mars1.jpg
display: Corrupt JPEG data: premature end of data segment `mars1.jpg'.
display: Corrupt JPEG data: premature end of data segment `mars1.jpg'.

The JPEG fragment is 232 sectors long but garbage can be seen at the end of the image - it means the fragment is too large.

Mars1 small.jpg

By trial and error, it's possible to determine that the fragment is 220 sectors long.

$ dd if=dfrws-2006-challenge.raw of=mars2.jpg skip=31533 count=220
220+0 records in
220+0 records out
$ display mars2.jpg
display: Corrupt JPEG data: premature end of data segment `mars2.jpg'.
display: Corrupt JPEG data: premature end of data segment `mars2.jpg'.

There is no garbage left in the picture.

Mars2 small.jpg

31475-31532: jpg fragment, hedgehog
31533-31752: jpg fragment, mars 
31753-32836 ?
$ dd if=dfrws-2006-challenge.raw skip=31475 count=`expr 31532 - 31475 + 1` > hedgehog.jpg
58+0 records in
58+0 records out
$ dd if=dfrws-2006-challenge.raw skip=31753 count=`expr 32836 - 31753 + 1` >> hedgehog.jpg
1084+0 records in
1084+0 records out
$ display hedgehog.jpg

Hedgehog small.jpg

Now, the exact file size can be found using PhotoRec on the recovered picture.

$ photorec hedgehog.jpg
PhotoRec 6.4, Data Recovery Utility, June 2006
Christophe GRENIER <grenier@cgsecurity.org>
https://www.cgsecurity.org
Please wait...
Disk hedgehog.jpg - 584 KB / 571 KiB - CHS 1 255 63 (RO), sector size=512

PhotoRec exited normally.
$ ls -l recup_dir.1/f0.jpg
-rw-rw-r--  1 kmaster kmaster 98354 Jul 10 11:30 recup_dir.1/f0.jpg
$ md5sum recup_dir.1/f0.jpg
db89684c177168036e274140ecf766a1  recup_dir.1/f0.jpg

The picture size is 98354 (193 sectors). We can now recover the Mars picture.

$ expr 31753 + 193 - 58
31888
$ dd if=dfrws-2006-challenge.raw skip=31533 count=220 > mars3.jpg 220+0 records in
220+0 records out
$ dd if=dfrws-2006-challenge.raw skip=31888 count=`expr 32836 - 31888 + 1` >> mars3.jpg

Mars3 small.jpg

As seen before, it's possible to get the exact file size:

$ photorec mars3.jpg
PhotoRec 6.4, Data Recovery Utility, June 2006
Christophe GRENIER <grenier@cgsecurity.org>
https://www.cgsecurity.org
Please wait...
Disk mars3.jpg - 598 KB / 584 KiB - CHS 1 255 63 (RO), sector size=512

PhotoRec exited normally.
$ ls -l recup_dir.2/
total 192
-rw-rw-r--  1 kmaster kmaster 188693 Jul 10 11:47 f0.jpg
$ md5sum recup_dir.2/f0.jpg
0915313e99af0f6bf13bc06bcd003113  recup_dir.2/f0.jpg

Manual recovery of zip files

Three zip files are recovered by PhotoRec but one of them is corrupted. A small Perl script was used to fix the zip file beginning at sector 45015 found by PhotoRec. Using unzip, this little Perl script locates and removes the extra sectors present in the file.

Manual recovery of XLS/Ole file

Office documents including Excel are using the OLE file format. A document has been identified at sector 2051 but this document hasn't been successfully recovered by PhotoRec (the file may be fragmented). A OLE file consists of a header structure and a list of all sectors following the header. In our case,

  • the size of the sectors is 512 bytes.
  • The SID (Sector Identifier) of the first sector of the directory stream is 1688.
  • The master sector allocation table is using 14 sectors: 1673-1685,1689.

The directory stream, SID 3761, lists the following components

Workbook SID 0 size 848333
SummaryInformation SID 1657 size 4096
DocumentSummaryInformation SID 1690 size 4096
Sectors Object SID
2051 Header N/A
2052-?,?-3729 Workbook 0-1656
x-x+20 21 extra sectors, not XLS N/A
3730-3737 SummaryInformation 1657-1664
3746-3758,3762 Allocation Table 1673-1685,1689
3761 RootDirectory 1688
3763-3770 DocumentSummaryInformation 1690-1697

Unfortunatly, I have failed to locate the 21 extra sectors. Anyway, the latest version of OpenOffice has been able to open the corrupted file and display most of the data. A new version of the document can be found on fcc web site.

Disk Layout

Sectors File type Note
0-8 HTML (fragment) Alice in Wonderland by Lewis Carroll
9-44 HTML Alice in Wonderland by Lewis Carroll
2051 Office (fragment) Excel, http://www.fcc.gov/Forms/Form477/477.xls
3868-4428 JPG 640x481 Mars
4436-4455 HTML (end is missing) A STUDY IN SCARLET by Sir Arthur Conan Doyle
4456-4501 HTML Stave 1: Marley's Ghost by Charles Dickens
4502-4556 HTML (beginning is missing) Stave 1: Marley's Ghost by Charles Dickens
7964-8284 Office Upcoming Research Symposium 1/3
8285-9473 JPG www.dfrws.org/2004/photos/day2/rodeo1-3-dfrws2004.jpg
9474 Office Upcoming Research Symposium 2/3
10031 Office Upcoming Research Symposium 3/3
11619-11822 JPG yeast 1/2
11823-11848 Text Moby Dick, Chapter i - LOOMINGS (page 1-6)
11849-12017 JPG yeast 2/2
12222-26116 JPG DFRWS 2006 Forensics Challenge, 11598x11598
27496-27606 HTML The Comedy of Errors by Shakespeare, Act I, Scene I (1/2)
27607-27977 JPG The porcupine
27978-28196 HTML The Comedy of Errors by Shakespeare (2/2)
28244-28245 HTML Moby Dick - chapter 134 (1/2)
28246-28306 Text (fragment) De la division du travail social, Emile Durkheim
28307-28344 HTML Moby Dick page - chapter 135 (2/2)
28439-28726 ZIP Zip Ok
28729-29528 ZIP ZIP 1/2
29529-29895 HTML The Tempest, Shakespeare
29896-31368 ZIP ZIP 2/2
31475-31532 JPG A hedgehog (1/2)
31533-31752 JPG Mars (1/2)
31753-31887 JPG A hedgehog (2/2)
31888-32036 JPG Mars (2/2)
32837-33397 Office www.tsa.gov/public/interweb/assetlibrary/Permitted_Prohibited_Facts.doc
34288-34306 Office "Reports on Computer Systems Technology"

[http://csrc.nist.gov/publications/nistpubs/800-26/sp800-26.doc 1/2

34307-34412 Text The Adventure of the Copper Beeches
34413-36236 Office http://csrc.nist.gov/publications/nistpubs/800-26/sp800-26.doc 2/2
36292-36640 JPG ?
36998-37649 Office PREVENTING CRIME: WHAT WORKS, WHAT DOESN'T, WHAT'S PROMISING

www.ncjrs.org/docfiles/wholedoc.doc 1/3

37727-39427 Office www.ncjrs.org/docfiles/wholedoc.doc 2/3
39477-40380 Office www.ncjrs.org/docfiles/wholedoc.doc 3/3
40638-41219 JPG www.dfrws.org/2004/photos/day2/rodeo1-breaf-dfrws2004.jpg 1/2
41239-41609 JPG www.dfrws.org/2004/photos/day2/rodeo1-breaf-dfrws2004.jpg 2/2
41611-43433 JPG http://imgsrc.hubblesite.org/hu/db/2006/10/images/a/formats/1280_wallpaper.jpg (1/2)
43434-44028 JPG www.dfrws.org/2004/photos/day2/rodeo1-dfrws2004.jpg
44029-44200 JPG http://imgsrc.hubblesite.org/hu/db/2006/10/images/a/formats/1280_wallpaper.jpg (2/2)
45015-45386 ZIP Zip 1/2
45390-45545 ZIP Zip 2/2
45566-45963 JPG U. S. Geological Survey Open-File Report 01-154

Slope off Florida Keys https://pubs.usgs.gov/of/2001/of01-154/data/bphotos/1565.jpg 1/2

45964-46103 Office Farm Credit System Insurance Corporation

Statement of Financial Condition March 31, 2006 and December 31, 2005
www.fcsic.gov/documents/3-31-2006%20Financial%20Statement.doc

46104-46826 JPG https://pubs.usgs.gov/of/2001/of01-154/data/bphotos/1565.jpg 2/2
46910-94836 JPG DFRWS 2006 Forensics Challenge, 8640x8640
94846-95628 JPG Saturn http://imgsrc.hubblesite.org/hu/db/2001/15/images/a/formats/full_jpg.jpg (1/2)
95630-96653 JPG Saturn http://imgsrc.hubblesite.org/hu/db/2001/15/images/a/formats/full_jpg.jpg (2/2)

Files

File type File size (in bytes) MD5 hash Sectors Note PhotoRec Score
HTML (fragment) 4608 ec89111e45da8265b641655d0f68725e 0-8 Alice in Wonderland by Lewis Carroll 5
HTML 18147 eec87931b03e5a4a4ef8fd51109a1227 9-44 Alice in Wonderland by Lewis Carroll 5
Office 869888 ? ~ 2051-3770 (21 extra sectors) http://www.fcc.gov/Forms/Form477/477.xls 1
JPG 287186 daf4205574abd6919b10ca8be92d17a3 3868-4428 640x481 Mars 5
HTML (end is missing) 10240 799ad2d2f2f1f17657338d98c97559c4 4436-4455 A STUDY IN SCARLET by Sir Arthur Conan Doyle 5
HTML 23544 f4481ed348d3d59c5dad80afeb0341f9 4456-4501 Stave 1: Marley's Ghost by Charles Dickens 5
HTML (beginning is missing) 27875 baf8b811ee9502408f9f0e73efa77cf0 4502-4556 Stave 1: Marley's Ghost by Charles Dickens 5
Office 450048 8d2a9a284e078805ada47db191f35244 7964-8284, 9474-10031 Upcoming Research Symposium 5
JPG 608703 4efc6c572683878efd8f3404ddaded7b 8285-9473 www.dfrws.org/2004/photos/day2/rodeo1-3-dfrws2004.jpg 5
JPG 190720 7b07320709e0caa947663f5df3a0a390 11619-11822, 11849-12017 yeast 5
Text 12826 f800a46e18fafd309825c5ee84a654a2 11823-11848 Moby Dick, Chapter i - LOOMINGS (page 1-6) 3
JPG 7113968 b070beae1606f67a342bc5f78c29c743 12222-26116 DFRWS 2006 Forensics Challenge, 11598x11598 5
HTML 168525 1959aa0391664b60fd0f2e64ed7a22f4 27496-27606, 27978-28196 The Comedy of Errors by Shakespeare, Act I, Scene I 2
JPG 189534 fe7e7ac67709f2d9c2483aa98c681b99 27607-27977 The porcupine 5
HTML 20019 045798407b927321326a547704e67831 28244-28245, 28307-28344 Moby Dick - chapter 134 and 135 2
Text (fragment) 30816 616a6bbe915c3dbf51014fd76f55b0e3 28246-28306 De la division du travail social, Emile Durkheim 0
ZIP 147150 ebabde39ba44d38888dd82606980498a 28439-28726 Zip Ok 5
ZIP 1163745 9a4c2d3a9bd203eb39c9f954a3c997e4 28729-29528, 29896-31368 ZIP 5
HTML 187793 158496c522d97b7389c9907cae777ac1 29529-29895 The Tempest, Shakespeare 5
JPG 98354 db89684c177168036e274140ecf766a1 31475-31532, 31753-31887 A hedgehog 2
JPG 188693 0915313e99af0f6bf13bc06bcd003113 31533-31752, 31888-32036 Mars 2
Office 287232 0e52e75029e99cd2e9dcd0af271cf4a2 32837-33397 www.tsa.gov/public/interweb/assetlibrary/Permitted_Prohibited_Facts.doc 5
Office 943616 d7ff92b8cc1c89c46a78288b9c673152 34288-34306, 34413-36236 http://csrc.nist.gov/publications/nistpubs/800-26/sp800-26.doc 2
Text 53870 5a12ef9dba88a186ef18a5d349b28e37 34307-34412 The Adventure of the Copper Beeches 3
JPG 178659 2fae8770cc013d22e9ea1c070f2f509b 36292-36640 ? 5
Office 1667584 4a22f04b097920d11fff4e192e0667a4 36998-37649, 37727-39427, 39477-40380 PREVENTING CRIME: WHAT WORKS, WHAT DOESN'T, WHAT'S PROMISING

www.ncjrs.org/docfiles/wholedoc.doc

2
JPG 487473 f8c51e0688796b5d616f0e5d4a94d104 40638-41219, 41239-41609 www.dfrws.org/2004/photos/day2/rodeo1-breaf-dfrws2004.jpg 2
JPG 1021085 7cce072e518fd72484c97adb1b4be08e 41611-43433, 44029-44200 http://imgsrc.hubblesite.org/hu/db/2006/10/images/a/formats/1280_wallpaper.jpg 5
JPG 304413 c0da37b3f1a07af790e6e9171cedc4d2 43434-44028 www.dfrws.org/2004/photos/day2/rodeo1-dfrws2004.jpg 5
ZIP 270181 f940fcc37c82e8ff1431e5c3c061611e 45015-45386, 45390-45545 Zip 2
JPG 573499 2320fe9c41eaddb864a56c2ddc4dd186 45566-45963, 46104-46826 U. S. Geological Survey Open-File Report 01-154

Slope off Florida Keys http://pubs.usgs.gov/of/2001/of01-154/data/bphotos/1565.jpg

5
Office 71680 109284cc5abddc83879a29785795fd75 45964-46103 Farm Credit System Insurance Corporation

Statement of Financial Condition March 31, 2006 and December 31, 2005
www.fcsic.gov/documents/3-31-2006%20Financial%20Statement.doc

5
JPG 24538540 db32b271506b2f4974791957627c61cc 46910-94836 DFRWS 2006 Forensics Challenge, 8640x8640 5
JPG 924877 1a5a843000ef617af93a9cad645e3cdf 94846-95628, 95630-96653 Saturn http://imgsrc.hubblesite.org/hu/db/2001/15/images/a/formats/full_jpg.jpg 1

PhotoRec Score Legend:

0 File not found
1 First sector identified
2 + correct file type
3 + all sectors identified
4 + correct file size
5 + correct checksum

Conclusion

PhotoRec has been able to retrieve most files automatically. Results can still be improved by brute forcing a JPG fragment location or adding some JPG search-only phases but this can be time-consuming. Thanks to

  • Daniel Sedory for letting me know about this contest and his long time involvement in TestDisk/PhotoRec project
  • the following ESIEA students: Gregory BLANC, Fabien BOUFFARD, Hicham CHAARAOUI, Karim EL FILALI, Amine HASSANI, Igor VALLEE for their work on OLE file format.

Christophe GRENIER