From Hadoop wiki,
The sync marker permits seeking to a random point in a file and then re-synchronizing input with record boundaries. This is required to be able to efficiently split large files for MapReduce processing.
But what it actually marks? And how it could be used in “seeking”?
Here is my starting investigation.
The code piece for sync marker generation.
1 | public static class Writer implements java.io.Closeable, Syncable { |
and the piece for insertion.
1 | public void sync() throws IOException { |
From these codes, the sync marker seems just being generated in the “Writer” initialization once, and write into the file header and the output while the output buffer full over a certain size.
- In
Writer
&RecordCompressWriter
: refer to theSYNC_INTERVAL
- refer to this commit, it has been changed from
100 * SYNC_SIZE
to5 * 1024 * SYNC_SIZE
- refer to this commit, it has been changed from
- In
BlobkCompressWriter
: refer toIO_SEQFILE_COMPRESS_BLOCKSIZE_KEY/DEFAULT
(default: 1,000,000)
1 | /** |
1 | /** |
Then, in the reading part, the sync marker will be read in the Reader
init
.
1 | /** Seek to the next sync mark past a given position.*/ |
Conclusion
- This sync marker allows the seeking operation to align to records or blocks boundary.
- But it relies on an existing seeking operation, which is implemented in
Seekable.seek()
. - Next question, “How is the seek implemented among a distributed file”.
References
- https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/io/SequenceFile.html
- https://www.reddit.com/r/hadoop/comments/4negaa/what_is_the_sequence_file_sync_marker_how_does_it/
- https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java
- https://github.com/apache/hadoop/blob/release-2.8.0-RC3/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/SequenceFile.java