Tuesday, 2 July 2013

Hadoop Interview Question and Answers for Freshers and Experience

Hadoop interview Question and Answers for Freshers and Experience  

                                           hadoop Big data

What is a SequenceFile in Hadoop?

A. ASequenceFilecontains a binaryencoding ofan arbitrary numberof homogeneous writable objects.
B. ASequenceFilecontains a binary encoding of an arbitrary number of heterogeneous writable objects.
C. ASequenceFilecontains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D. ASequenceFilecontains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be sametype.

Answer: D

Is there a map input format in Hadoop?

A.  Yes, but only in Hadoop 0.22+.
B.  Yes, there is a special format for map files.
C.  No, but sequence file input format can read map files.
D.  Both 2 and 3 are correct answers.
Answers: C

What happens if mapper output does not match reducer input in Hadoop?

A.  Hadoop API will convert the data to the type that is needed by the reducer.
B.  Data input/output inconsistency cannot occur. A preliminary validation check is executed prior to the full execution of the job to ensure there is consistency.
C.  The java compiler will report an error during compilation but the job will complete with exceptions.
D.  A real-time exception will be thrown and map-reduce job will fail.

Answer: D

Can you provide multiple input paths to a map-reduce jobs Hadoop?

A.  Yes, but only in Hadoop 0.22+.
B.  No, Hadoop always operates on one input directory.
C.  Yes, developers can add any number of input paths.
D.  Yes, but the limit is currently capped at 10 input paths.

Answer:  C

Can a custom type for data Map-Reduce processing be implemented in Hadoop?

A.  No, Hadoop does not provide techniques for custom datatypes.
B.  Yes, but only for mappers.
C.  Yes, custom data types can be implemented as long as they implement writable interface.
D.  Yes, but only for reducers.

Answer: C


The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable. They have almost the same features as default java classes. What are these writable data types optimized for?

A.  Writable data types are specifically optimized for network transmissions
B.  Writable data types are specifically optimized for file system storage
C.  Writable data types are specifically optimized for map-reduce processing
D.  Writable data types are specifically optimized for data retrieval

Answer: A

What is writable in  Hadoop?

A.  Writable is a java interface that needs to be implemented for streaming data to remote servers.
B.  Writable is a java interface that needs to be implemented for HDFS writes.
C.  Writable is a java interface that needs to be implemented for MapReduce processing.
D.  None of these answers are correct.

Answer: C

What is the best performance one can expect from a Hadoop cluster?

A.  The best performance expectation one can have is measured in seconds. This is because Hadoop can only be used for batch processing
B.  The best performance expectation one can have is measured in milliseconds. This is because Hadoop executes in parallel across so many machines
C.  The best performance expectation one can have is measured in minutes. This is because Hadoop can only be used for batch processing
D.  It depends on on the design of the map-reduce program, how many machines in the cluster, and the amount of data being retrieved

Answer: A

What is distributed cache in Hadoop?

A.  The distributed cache is special component on namenode that will cache frequently used data for faster client response. It is used during reduce step.
B.  The distributed cache is special component on datanode that will cache frequently used data for faster client response. It is used during map step.
C.  The distributed cache is a component that caches java objects.
D.  The distributed cache is a component that allows developers to deploy jars for Map-Reduce processing.

Answer: D

Can you run Map - Reduce jobs directly on Avro data in Hadoop?

A.  Yes, Avro was specifically designed for data processing via Map-Reduce
B.  Yes, but additional extensive coding is required
C.  No, Avro was specifically designed for data storage only
D.  Avro specifies metadata that allows easier data access. This data cannot be used as part of map-reduce execution, rather input specification only.

Answer: A

What is AVRO in Hadoop?

A.  Avro is a java serialization library
B.  Avro is a java compression library
C.  Avro is a java library that create splittable files
D.  None of these answers are correct

Answer: A

Will settings using Java API overwrite values in configuration files in Hadoop?

A.  No. The configuration settings in the configuration file takes precedence
B.  Yes. The configuration settings using Java API take precedence
C.  It depends when the developer reads the configuration file. If it is read first then no.
D.  Only global configuration settings are captured in configuration files on namenode. There are only a very few job parameters that can be set using Java API.

Answer: B

Which is faster: Map-side join or Reduce-side join? Why?

A.  Both techniques have about the the same performance expectations.
B.  Reduce-side join because join operation is done on HDFS.
C.  Map-side join is faster because join operation is done in memory.
D.  Reduce-side join because it is executed on a the namenode which will have faster CPU and more memory.

Answer: C

What are the common problems with map-side join in Hadoop?

A.  The most common problem with map-side joins is introducing a high level of code complexity. This complexity has several downsides: increased risk of bugs and performance degradation. Developers are cautioned to rarely use map-side joins.
B.  The most common problem with map-side joins is lack of the avaialble map slots since map-side joins require a lot of mappers.
C.  The most common problems with map-side joins are out of memory exceptions on slave nodes.
D.  The most common problem with map-side join is not clearly specifying primary index in the join. This can lead to very slow performance on large datasets.

Answer: C

How can you overwrite the default input format in Hadoop?

A.  In order to overwrite default input format, the Hadoop administrator has to change default settings in config file.
B.  In order to overwrite default input format, a developer has to set new input format on job config before submitting the job to a cluster.
C.  The default input format is controlled by each individual mapper and each line needs to be parsed indivudually.
D.  None of these answers are correct.

Answer: B

What is the default input format in Hadoop?

A.  The default input format is xml. Developer can specify other input formats as appropriate if xml is not the correct input.
B.  There is no default input format. The input format always should be specified.
C.  The default input format is a sequence file format. The data needs to be preprocessed before using the default input format.
D.  The default input format is TextInputFormat with byte offset as a key and entire line as a value.

Answer: D

Why would a developer create a map-reduce without the reduce step Hadoop?

A.  Developers should design Map-Reduce jobs without reducers only if no reduce slots are available on the cluster.
B.  Developers should never design Map-Reduce jobs without reducers. An error will occur upon compile.
C.  There is a CPU intensive step that occurs between the map and reduce steps. Disabling the reduce step speeds up data processing.
D.  It is not possible to create a map-reduce job without at least one reduce step. A developer may decide to limit to one reducer for debugging purposes.

Answer: C

How can you disable the reduce step in Hadoop?

A.  The Hadoop administrator has to set the number of the reducer slot to zero on all slave nodes. This will disable the reduce step.
B.  It is imposible to disable the reduce step since it is critical part of the Mep-Reduce abstraction.
C.  A developer can always set the number of the reducers to zero. That will completely disable the reduce step.
D.  While you cannot completely disable reducers you can set output to one. There needs to be at least one reduce step in Map-Reduce abstraction.

Answer: C

What is PIG? in Hadoop

A.  Pig is a subset fo the Hadoop API for data processing
B.  Pig is a part of the Apache Hadoop project that provides C-like scripting languge interface for data processing
C.  Pig is a part of the Apache Hadoop project. It is a "PL-SQL" interface for data processing in Hadoop cluster
D.  PIG is the third most popular form of meat in the US behind poultry and beef.

Answer: B

What is reduce - side join in Hadoop?

A.  Reduce-side join is a technique to eliminate data from initial data set at reduce step
B.  Reduce-side join is a technique for merging data from different sources based on a specific key. There are no memory restrictions
C.  Reduce-side join is a set of API to merge data from different sources.
D.  None of these answers are correct

Answer: B

What is map - side join in Hadoop?

A.  Map-side join is done in the map phase and done in memory
B.  Map-side join is a technique in which data is eliminated at the map step
C.  Map-side join is a form of map-reduce API which joins data from different locations
D.  None of these answers are correct

Answer: A

How can you use binary data in MapReduce in Hadoop?

A.  Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file.
B.  Binary data cannot be used by Hadoop fremework. Binary data should be converted to a Hadoop compatible format prior to loading.
C.  Binary can be used in map-reduce only with very limited functionlity. It cannot be used as a key for example.
D.  Hadoop can freely use binary files with map-reduce jobs so long as the files have headers

Answer: A

What are map files and why are they important in Hadoop?

A.  Map files are stored on the namenode and capture the metadata for all blocks on a particular rack. This is how Hadoop is "rack aware"
B.  Map files are the files that show how the data is distributed in the Hadoop cluster.
C.  Map files are generated by Map-Reduce after the reduce step. They show the task distribution during job execution
D.  Map files are sorted sequence files that also have an index. The index allows fast data look up.

Answer: D

What are sequence files and why are they important in Hadoop?

A.  Sequence files are binary format files that are compressed and are splitable. They are often used in high-performance map-reduce jobs
B.  Sequence files are a type of the file in the Hadoop framework that allow data to be sorted
C.  Sequence files are intermediate files that are created by Hadoop after the map step
D.  Both B and C are correct

Answer: A

How many states does Writable interface defines ___ in Hadoop?

A. Two
B. Four
C. Three
D. None of the above

Answer: A

1 comments:

  1. Nice content presentation! Thanks for putting the efforts on gathering useful content and sharing here. You can find more Hadoop interview related question and answers in the below forum.

    Hadoop interview questions and answers

    ReplyDelete