! Hadoop + Solr AWS Deployment

Ariya Bala Sadaiappan
Hadoop+Solr+AWS Integration
2 Apr 2014
!
!
!
Hadoop + Solr AWS Deployment
!
Installation - Cloudera Manager / Hadoop Cluster
!
1. Launch EC2 Instances: (Use Spot Instances)
!
Reference: http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-viacloudera-manager/
• Login to AS AWS Dev
• https://console.aws.amazon.com/iam/home?#users
• Launch ’n’ EC2 Spot Instances for installing Cloudera Manager and the hadoop cluster
• Navigate EC2 > Spot Requests > Request Spot Instances > My AMIs
• Select AMI Id: ami-a3d7ccca.
• Opt for Preferred Instance
• Configure Instance Details
• Number of Instances - N {User specific}
• Max.Price - 0.245
• Availability Zone - Select the one with minimum Current pricing in the table
• Add Storage
• Ensure you have 50 GB of Root Folder size
• Tag Spot Request
• Tag with custom Key-Value : POC - {Date} ex: Apr14
• Ensure the Tag key is “POC”
• Configure Security Group
• Select Existing Security Group
• Security GroupId - sg-f1617e9a
• Name - jclouds#poc-ariya
• Review and Launch
• Choose Existing key pair - ads
• Request Spot Instances and View Instance Requests
• Copy the list of comma separated spot request ids and place it in a file name
“spotrequests.txt”, replace “,”
Wait until the Status changes from “pending-evaluation” to “active”
!
!
2. Prepare Export Variables:
!
1
This section explains the scripts used to prepare the list of export variables.
!
Note: The scripts and files referred in the document is present in the GDrive under Hadoop_Solr
POC/archive.zip
!
Ensure all the spot requests are active.
These export variables which holds the Public DNS names are used to place the jars and files as a
part of automation
Master script: exec_prepIP.sh {SpotRequest_File Name}
./exec_prepIp.sh spotrequests.txt ip.txt publicip.txt privateip.txt
exportpublicip.txt
Input Arguments: spotrequests.txt
- This should hold the list of Spot Request Id one per line
ip.txt
- This file will be created to hold the list of privateip:publicip
publicip.txt
- This file will be created to hold the list of public ips
privateip.txt
- This file will be created to hold the list of private ips
exportpublicip.txt
- This file will be created to hold the list of export statements of public ips
Final Output: !
exportpublicip.txt
Will be filled with the export statements for future use
The script in turn calls
getSpotInstances.sh $spotRequestId
- Prepares ip.txt with internalIP:externalIP
prepPublicPrivateIP.sh
- Reads ip.txt and prepares publicip.txt and privateip.txt
exportPublicIP.sh
- Reads publicIP.txt and prepares exportpublicip.txt with export statements
!
!
3. Install Latest Version of Cloudera Manager:
!
• Open Terminal
• Copy the content from the file “exportpublicip.txt” generated by the Section 2
• Content would be similar to the one below
export
export
export
export
export
m1=54.226.120.23
m2=54.197.132.98
m3=54.234.125.43
m4=54.242.203.115
m5=54.237.210.213
2
export
export
export
export
!
m6=54.80.65.111
m7=54.205.6.5
m8=54.197.221.12
m{N}=54.198.240.198
• Login to Machine $m1 to install Cloudera Manager using below commands
!
ssh -i ~/.ssh/asdev.pem -oStrictHostKeyChecking=no ubuntu@$m1
wget http://archive.cloudera.com/cm4/installer/latest/cloudera-managerinstaller.bin
chmod +x cloudera-manager-installer.bin
!
!
sudo ./cloudera-manager-installer.bin
• Accept the license and continue the installation
• After the installation, login to the Cloudera Manager in the browser
• http://54.226.120.23:7180 [ i.e $m1:7180]
!
• with admin/admin as credentials
• Select Cloudera Standard Edition
• Click Launch Classic Wizard
• Enter the list of Private DNS of all the instances (m0 - m{n})
• Copy the contents from “privateip.txt” generated from “Section 2”
!
ip-10-46-253-27.ec2.internal,ip-10-44-78-201.ec2.internal,ip-10-44-137-72.ec2.internal,ip10-127-81-209.ec2.internal,ip-10-45-171-56.ec2.internal,ip-10-94-42-39.ec2.internal,ip-10
-120-113-81.ec2.internal,ip-10-111-154-10.ec2.internal,ip-10-44-133-206.ec2.internal,ip-1
0-87-150-185.ec2.internal,ip-10-126-142-60.ec2.internal
• Click Search
• Ensure all the hosts are validated with the Green Tick
• If the hosts are not able to talk to each other, Correct the security group
• i.e Add a new rule to the Security group of the instances
• New rule with All TCP and type its own Security group name. sg-xxxx
• By this way, we can allow the instances to talk to each other.
• Click Continue
• Select the needed parcels to be installed and Continue
• Provide appropriate ssh details
• Login as ubuntu, not root
• Private Key : asdev as given during instance creation
• Installation will start
• Continue and Customise the services needed to install as part of Hadoop stack
• Use Embedded Database and Test Connection
• Complete the installation
!
In the DashBoard, the health status of the machines should not be bad.
3
If Bad, free up some space using the script mentioned in the Remove Cache/Archive Section.
!
4. Add/Remove Services: [Optional]
!
!
Click Home
Click on the drop down on the right of the Cluster name
Rename the Cluster if needed
Click “Add a Service” from the drop down
Select Solr
Select the set of Dependencies
Select one or more host for the new service
Accept and Continue
Click on the drop down next to the available services to delete
Click Delete
If there are dependant services, Delete that and continue the deletion
!
5. Preparing Environment:
!
Place Jars/Files:
!
• Open the Terminal (@laptop)
• Export all the machine variables as referred in the previous page
• Execute
sh exec_placeJars.sh publicip.txt placeJars_final
This script prepares and executes sequentially the placeJars_final_{n}.sh which places the below utility jars User can manually execute the prepared scripts in parallel to save time
• The script moves the jars from Source: /home/ubuntu/uploads/jars/dist-lib/*.jar
/home/ubuntu/uploads/jars/ext/*.jar
/home/ubuntu/uploads/jars/extraction-lib/*.jar
/home/ubuntu/uploads/jars/morphlines-cell-lib/*.jar
/home/ubuntu/uploads/jars/morphlines-core-lib/*.jar
/home/ubuntu/uploads/jars/solrj-lib/*.jar
/home/ubuntu/uploads/jars/web-inf-lib/*.jar
Target: /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop-0.20mapreduce/lib/
!
• The jars will be already present in all the machines as a part of the {HASOLR-POC - ami-0dc3df64}
• Restart the cluster from the Cloudera Manager UI
!
Prepare HDFS:
!
4
Deploy MRIndexer Tool:
!
Use below command to build the MR Indexer Tool locally and ship the jar to $m1
sh deployMRjar.sh $m1
This script builds the code locally and pushes it to the /opt/cloudera/parcels/
SOLR-1.2.0-1.cdh4.5.0.p0.4/lib/solr/contrib/mr/
!
!
Note: Modify the script to reflect your local build path
Preparing Environment for docS3:
Open the Terminal
!
sh ~/scripts/aws-dev/prepdocS3.sh $m1
This script Downloads the docs3 jar from the AS dev environment to EC2 instance
Installs mysql server in the EC2 instance (enter “root” as pwd when prompted)
Executes prepareFileList.sh and prepares the list of S3 object keys (inputFileList.txt )
to be processed
Note: Edit mysql.sh to match the needs to query the Accession numbers
Prepare Hadoop job folders:
!
Login to $m1
Prepare the HDFS using the below commands
!
sudo -u hdfs hadoop fs -mkdir -p /user/$USER
sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER
hadoop fs -mkdir -p /user/$USER/indir
hadoop fs -copyFromLocal ~/uploads/files/samplefiles/* /user/$USER/indir/
hadoop fs -ls /user/$USER/indir
hadoop fs -rm -r -skipTrash /user/$USER/outdir
hadoop fs -mkdir /user/$USER/outdir
hadoop fs -ls /user/$USER/outdir
sudo -u hdfs hadoop fs -mkdir /outdir
sudo -u hdfs hadoop fs -chown $USER:$USER /outdir
hadoop fs -put /home/ubuntu/uploads/files/txshards.conf /tmp/
hadoop fs -put /home/ubuntu/uploads/files/fdshards.conf /tmp/
nano /etc/hadoop/conf/hdfs-site.xml
!
!
!
With inputFileList:
5
hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar /home/ubuntu/uploads/jars/mrlib/solr-map-reduce-4.7-SNAPSHOT.jar -D 'mapred.child.java.opts=-Xmx8G' --log4j /opt/
cloudera/parcels/SOLR-1.2.0-1.cdh4.5.0.p0.4/share/doc/search/examples/solr-nrt/
log4j.properties --shards 1 --shardsConf hdfs://ip-10-111-154-10.ec2.internal:8020/tmp/
txshards.conf --fulldocshardsConf hdfs://ip-10-111-154-10.ec2.internal:8020/tmp/
fdshards.conf --morphline-file /home/ubuntu/uploads/files/readASXML.conf --solr-home-dir /
home/ubuntu/uploads/files/TX-collection --TX-solr-home-dir /home/ubuntu/uploads/files/
TX-collection --FF-solr-home-dir /home/ubuntu/uploads/files/FF-collection --FD-solr-home-dir
/home/ubuntu/uploads/files/FD-collection --output-dir hdfs://ip-10-111-154-10.ec2.internal:
8020/user/$USER/outdir_5L --collection collection1 --verbose --input-list /home/ubuntu/
uploads/scripts/docS3/inputFileList_5L.txt
!
rm -r /home/ubuntu/downloads/outdir
hadoop fs -get /user/ubuntu/outdir /home/ubuntu/downloads/
!
scp -i ~/.ssh/asdev.pem -r ubuntu@$m1:/home/ubuntu/downloads/outdir ~/Downloads/stats/
!
6. Preparing Environment for creating AMI: [ Optional]
!
!
!
Create a spot instance using Ubuntu 12.4 64 bit (ami-59a4a230) with RAM of 30 GB
Install Java:
If the ami doesn't come with Java 7, follow the steps to install the same
Download tarball from http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloadsjavase7-521261.html#jre-7u25-oth-JPR
sudo scp -i ~/.ssh/asdev.pem ~/Downloads/jre-7u25-linux-x64.gz ubuntu@$m1:/home/
ubuntu/
sudo scp -i ~/.ssh/asdev.pem ~/Downloads/jdk-7u25-linux-x64.gz ubuntu@$m1:/home/
ubuntu/
ssh -i ~/.ssh/asdev.pem ubuntu@$m1 "sudo mkdir /usr/local/java;sudo cp jre-7u25-linuxx64.gz /usr/local/java/;sudo cp jdk-7u25-linux-x64.gz /usr/local/java/"
ssh -i ~/.ssh/asdev.pem ubuntu@$m1
cd /usr/local/java
sudo tar xvzf jdk-7u25-linux-x64.gz
sudo tar xvzf jre-7u25-linux-x64.gz
sudo vi /etc/profile
JAVA_HOME=/usr/local/java/jdk1.7.0_25
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
JRE_HOME=/usr/local/java/jre1.7.0_25
PATH=$PATH:$HOME/bin:$JRE_HOME/bin
export JAVA_HOME
export JRE_HOME
export PATH
!
!
6
Ensure java 7 is installed in one of the below folders. If not, duplicate the “java 7” folder to /usr/lib/
jvm/j2sdk1.7-oracle
!
!
/usr/lib/j2sdk1.6-sun /usr/lib/jvm/java-6-sun /usr/lib/jvm/java-1.6.0-sun-1.6.0.* /usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/ /usr/lib/jvm/j2sdk1.6-oracle /usr/lib/jvm/j2sdk1.6-oracle/jre /usr/java/jdk1.6* /usr/java/jre1.6* /usr/java/jdk1.7* /usr/java/jre1.7* /usr/lib/jvm/j2sdk1.7-oracle /usr/lib/jvm/j2sdk1.7-oracle/jre /Library/Java/Home /usr/java/default /usr/lib/jvm/default-java /usr/lib/jvm/java-openjdk /usr/lib/jvm/jre-openjdk /usr/lib/jvm/java-1.7.0-openjdk* /usr/lib/jvm/jn
Also place the below folders in the instance to create the AMI
FF-collection
FD-collection
txshards.conf
fdshards.conf
readASXML.conf
!
!
!
!
Create image out of this instance
Execution Statistics:
Machine
Type
Machine
Count
Input File
Count File Read
Mappers
Used
Reducers
Used
Start Time
End Time
Job Id
Custom
m2.4x
large
300
897,174
3052
1148
Tue Apr 22
21:49:15
UTC 2014
Wed Apr
23
04:28:03
UTC 2014
job_20140
4222147_0
001[58]
numLines*
6
m2.4x
large
300
500,000
2977
1148
Wed Apr
23
06:18:24
UTC 2014
Wed Apr
23
10:18:03
UTC 2014
job_20140
4222147_0
059
numLines*
6
!
!
!
List of Errors Faced:
7
http://stackoverflow.com/questions/20687517/cannot-allocate-memory-errno-12-errors-duringruntime-of-java-application
!
Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x0000000752680000, 167247872, 0) failed;
error='Cannot allocate memory' (errno=12)
!
!
!
Error 2:
attempt_201404220829_0001_m_000252_1: Java HotSpot(TM) 64-Bit Server
VM warning: INFO: os::commit_memory(0x0000000609c00000, 255328256,
0) failed; error='Cannot allocate memory' (errno=12)
attempt_201404220829_0001_m_000252_1: #
attempt_201404220829_0001_m_000252_1: # There is insufficient memory
for the Java Runtime Environment to continue.
attempt_201404220829_0001_m_000252_1: # Native memory allocation
(malloc) failed to allocate 255328256 bytes for committing reserved
memory.
attempt_201404220829_0001_m_000252_1: # An error report file with
more information is saved as:
attempt_201404220829_0001_m_000252_1: # /mapred/local/taskTracker/
ubuntu/jobcache/job_201404220829_0001/
attempt_201404220829_0001_m_000252_1/work/hs_err_pid1099.log
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero
status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
8