tPrasen Mukherjee
editI live in Hyderabad and work for Bing!. I am interested in data mining, mathematics, physics, religion and lots of other random topics.
Resume
editWhat do I do
edit- Clustering in Mobile Ad Network Space
- Extended Fielder Method ( Based on SVD decomposition )
- Stacked RBMs ( aka Restricted Boltzmann Machine )
Earlier Work
editArchitecting Mapquest/Mobile Platform Solutions
edit- Mapquest
- Single Line Geo Search
- Extracting contextual location based on short term browser behavior
- Mood based route recommendation
- WebTileCache
- Single Line Geo Search
- Mobile AIM Platform
- Wireless Village/IMPS
- Open Mobile SDK effort (http://openmobileplt.sourceforge.net/)
- AirMedia Platform
Spearheading R&D initiatives
edit- Leading R&D initiative by collaborating with premier research inistitutes
- Leading open source Machine Learning initiatives (http://lucene.apache.org/mahout)
Ad monetization (Mobile/Social-Networking)
edit- Interact/Influence different business units to aid revenue-generation and monetization
- Apply technology to gauge trends and user-behaviour
- Spot market opportunities
Thoughts on Personalized Recommended Systems
editRecommendation System ( common for Mobile/Desktop)
edit- Recommend (news) articles
- Recommend upcoming events
- Personalized browsing - Idea here is to learn the immediate intent of user dynamically ( in the current session as he broswes more and more pages ) from his past history, and recommend articles based on that.
Temporal profiling of user log data
edit- Analyze distribution of personal topics ( probably extracted via PLSI,LDA etc.) over time
- Use the above distribution data to group temporally similar users
- Reference papers :
- Andrew McCallum's Publications
- A Continuous-Time Model of Topic Co-occurrence Trends. Xuerui Wang, Wei Li, and Andrew McCallum. AAAI Workshop on Event Detection, 2006. Capture the time distributions not only of a topics, but also of their co-occurrences. For example, notice that while NLP and ML have both been around for a long time, but their co-occurrence has been rising recently. The model is effectively a combination of the Pachinko Allocation Model (PAM) and Topics-Over-Time (TOT).
- Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. Xuerui Wang and Andrew McCallum. Conference on Knowledge Discovery and Data Mining (KDD) 2006. A new LDA-style topic model that models trends over time. The meaning of a topic remains fixed and reliable, but its prevalence over time is captured, and topics may thus focus in on co-occurrence patterns that are time-sensitive. Unlike other work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps. Improvements in topic saliency and the ability to predict time given words.
Personalized Summarization
edit- Can be applied to mobile devices
- Perform document summarization ( snippet-generation ) based on user profile.
- Describe a user based on his past behavior. This requires NLP techniques.
Use Session data to aid unsupervised clustering
edit- Each session could be thought as a representative of a user-intent ( aka topic ) as generally ( there are always exceptions) we do related things in a session. Like searching for a travel destination, followed by searching for hotel rates, followed by searching for air-tickets. If we make this assumption, this greatly reduces the problem of high dimensions.
- Now instead of clustering based on document/word tuple, we could create a representative vector for each user session ( call it a session-vector in term-space ) and apply clustering logic on session/word tuple. We can use LSH to see the distribution of session-vectors across different hash-buckets, which can be used to estimate the aprior number of clusters for latent variable algorithms like PLSI/LDA etc.
ML Algo resources
editGeneric resources
editgenerative Vs discriminative
editLogistic Regression
edit- (ML 15.5) Logistic regression (binary) - computing the gradient
- Logistic Regression Learning, Cost Function, Gradient descent
SVM
edit- One of the best explanations of linear SVM
- Patrick H Winston Lecture 16 : Learning Support Vector Machines. MIT OpenCourseware 6.034 Fall 2010 Artificial Intelligence
- Anoop M Namboodiri IIIT Hyd prof, VC Dimension
- @ 47:15 Use standard convex QP solvers to get lagrange multipliers and then use the equations to get w and b which defines the decision boundary.
- Part 2 of above, explaining kernel trick
- PPTs
- Issues
- Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating appropriate similarity measures
- Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.
- Optimization criterion – Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are tested
- Choice of kernel
Understanding the essence of probability distributions
editUnderstanding Maximum Likelihood
editIts the basis of probability theory, we use it everyday without realizing it. How do we compute probability : #-possible-outcomes/#-total-outcomes. This intuition/axiom/theorem/rule is based on max likelihood. Lets say we have a Bernoulli's event with n-trials and k-positive outcomes. We say probability of positive outcome is k/n. This comes from max-likelihood. Lets see it how.
- Probability of data/likelihood ( L ) = p^k * (1-p)^(n-k)
- Which value of p will maximize L. That is given by dL/dp = 0.
- Solving dL/dp=0 will give same results as solving d(logL)/dp=0.
- Solving d(logL)/dp=0 ==> p = k/n Q.E.D.
Bernoulli type distributions
edit- Relation between Bernoulli/Binomial/Beta
- Binomial is a special case of Bernoulli.
- When n(#-of-trials) = 1, the binomial distribution is a Bernoulli distribution.
- Beta Distribution is continuous extension of Binomial Distribution
- Beta Distribution is also the conjugate prior of Binomial and Bernoulli Distribution in Bayesian statistics.
Multivariate generalizations of above statements
edit- Relation between Multinomial/Dirichlet
- Multinomial distribution is a multivariate generalization of the binomial distribution.
- When k(#-of-outcomes) = 2, the multinomial distribution is the binomial distribution.
- Dirichlet distribution is the conjugate prior of the multinomial in Bayesian statistics.
- Dirichlet distribution is the multivariate generalization of the Beta distribution.
Other important relations between distributions
edit- The binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while the product np remains fixed.
- The Gamma function generalizes the factorial function for non-integer and complex values of n. If n is a positive integer, then
showing the connection to the factorial function.
- Just as the gamma function for integers describes factorials, the beta function can define a binomial coefficient after adjusting indices:
Why chose Dirichlet
editLets have topic = z, doc = d, word = w.
If we assume P(d|z) is Multinomial i.e.
What the above equation means is that in a document there are n (doc-length) slots to be filled by one of the z topics, ranging from to . P(d|z) is effectively the document distribution and is often referred as just P(d). This is different from p(d) which is the probability of d.
Assuming P(d|z) is a multinomial distribution what are our choices for P(z|d)
- From Bayesian statistics :
- P ( d | z ) * P ( z ) = P ( z | d ) * P ( d )
- P ( z | d ) = P ( d | z ) * P ( z ) / P ( d )
- = Multinomial () * P ( z ) / Constant(i.e. Evidence )
- P ( z | d ) = P ( d | z ) * P ( z ) / P ( d )
Since Dirichlet is a conjugate prior of Multinomial, it is a good choice for P ( z | d ). And since P ( z | d ) is dirichlet, so would be P ( z ) ( from the definition of Conjugate Prior).
CondtionalRandomField (CRF)
edit- Nice Articles
- http://en.wikipedia.org/wiki/Hidden_Markov_model - Pretty good article on HMMs, understandign of which is extremely important to really understand CRF
- http://www.cs.huji.ac.il/course/2004/learns/ConditionalRandomFields2.ppt
- Forward Backward Algorithm
- Comparison with HMM
- In HMM observations ( emission entities ) are independent
- Unlike CR which is based on discriminative model, HMM is based on a generative model, which means its based on joint probability distribution p(s,o) and requires a model of p(o)
- Comparison with MEMM
- http://www3.cs.stonybrook.edu/~ychoi/cse628/lecture/hmm-memm-crf.pdf
- MEMM is a directed graph, means that hidden-states are dependent ONLY on past history
- Like CRF, MEMM is also based on discriminative model, which means its based on conditional probability distribution p(s|s',o) and doesn't require a model of p(o) as observation (o) is already given.
Hidden Markov Models
editEssentially based on hidden variable model which can transition from 1 variable(state) to another with some probability. These states can be interpreted as some known set states which cannot be directly observed, like mood,affinity etc. This approach can be used to model supervised clustering, where clusters are known set of states and features being represented as transmission probabilities (also known as confusion matrix). In analogy with term-doc text mining parlance documents ( or sentences or browsing behaviors ) can be generated by sampling over terms in sequence. The hidden states could be abstract,intro,content,conclusion for documents ( subject,predicate,verb for sentences and start,browse,end for browsing users ).
- An HMM is defined by λ = (M,N,A,B,π), where M=num_observations, N=num_hidden_states, A = NXN self-similarity matrix(state transition probabilities), B = MXN confusion matrix ( state->observable emission probabilities), π = NX1 array of initial hidden state probabilities.
- Basic problems/solutions
- Training ==> Given a set of sequence observations O[i]s estimate HMM parameters by maximum likelihood ( i.e. maximize P(O|λ) ) == Baum Welch Algorithm
- Inferencing ==> Given an observable-sequence find out the most likely state-sequence == Viterbi Algo
LocalitySensitiveHashing (LSH)
editCAVEAT :: These are all my personal observations. Appreciate constructive comments.
Interesting technique to do the following
- Speeds up Clustering
- It avoids pairwise distance/similarity computation
- Near Duplicate Detection
- Tries to solve "Curse of Dimensionality" problem
- As query is independent of corpus size
- More applicable to multimedia indexing
- Unlike text indexing, they follow QueryByExample hence LSH is directly applicable.
- Less applicable to text indexing
- As text-queries are very short ( 1-2 words), hence there will be very few hits as very few documents will have 1-2 words.
Basic LSH Algorithm
edit- Use functions of the form
- Preprocessing:
- Select
- For all , hash p to buckets
- Query
- Retrieve the points from buckets until
- Total points <= 2L or all
- Return top k-points by doing similarity computation among 2L points
- Retrieve the points from buckets until
LSH hash functions
editBasic definition can be found in Locality_sensitive_hashing
Hamming Hash Function
editThis is the first LSH hash funchion mentioned in Indyk-Motwani'98's original paper on LSH. Seems to be more applicable to compute Image Similarity ( Color Histogram match etc )
- Article link : Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality (1998) Indyk-Motwani
- Metric : Hamming distance
- , i.e., the i-th bit of vector v in d dimensional Hamming space, where i=1 to d
Minwise Permutation
editInitial intent was to do similarity computation using some probabilistic technique. Later used as an efficient LSH. Set based approach. All the other approaches are vector/space based. Since its set based unlike vector-space approach, it doesn't take care of the relative distribution of term-frequencies in a document. More applicable to do Clustering, Duplicate Detection etc. in text corpus.
- Article link : Minwise Independent Permutations Broder et al.
- Metric : Jaccard Coefficient
- , where J = Jaccard Coefficient
- Good for text/document similarity
p-Stable Distribution
editUses properties of stable distribution family. Seems to be the most talked about these days. Wide applications including text-retrieval, (sub)image-retrieval, clustering(text,media), duplicate-detection(text,media). Andoni-Indyk-06 extended the projection space from 1 dimension to t-dimension and introduced leech-lattice LSH based on Voronoi diagrammes and leech-lattice-space.
- Good reading articles on p-stable based LSH
- LSH Scheme Based on p-Stable Distributions Indyk et al.
- Applying Hash-based Indexing in Text-based Information Retrieval. Benno Stein and Martin Potthast.
- Metric : Basically it honours norm
- , where
- , and
- denote the prob density of the absolute value of the p-stable distribution
Semantic Hashing
editThis is based on stacked RBMs with layered pre training ( of RBM layers ) followed by backpropagation based training. This seem to work on input data vectors which are quasi binary and can be projected on a hamming space ( means coefficient-values close to 0 or 1 ). Thats one of the problem in modeling a problem with continuous probability distributions into RBM. Hence clustering based on containment or just presence of a click ( instead of click-count or click-probability ) can be modeled using RBMs.
- Nice explanation + Python Code Echen's Implementation
- Richard Weiss's explanation [1]
- Article Link : Semantic Hashing
- Code : Thank's to Benjamin Chung's jaRBM package
MapReduce Variants/Improvisations
editMy SVD Observations
edit- The key to working with SVD of any rectangular matrix A is to consider AAT and ATA.
- The columns of U, that is t by t, are eigenvectors of AAT,
- The columns of V, that is d by d, are eigenvectors of ATA.
- The singular values on the diagonal of S, that is t by d, are the positive square roots of the nonzero eigenvalues of both AAT and ATA.
- Computation of EigenValues/Vectors
- First transform the given matrix into an upper-triangular form ( via HousehOlder reflections/Gram-Schmidt Normalization etc. )
- Then use iterative QR algorithm.
Random Indexing
editRandom Indexing is a vector space technique that provides an efficient and scalable approximation to distributional similarity problems.
Instead of collecting word-use statistics into a 54,000 x 37,600 (M x N) words-by-documents matrix F, collect them into a 54,000 x 4,000 (M x N') matrix G. How? 0. Set G = 0. 1. For each document d, make an N'-dimensional (e.g., 4,000-D) ternary INDEX VECTOR i_d by placing a small number k of -1s and +1s (e.g., ten of each; k = 10) at random among the 4,000; the rest will be 0s: i_d = 0000000+000...000+000000-000...000, that's 4,000 "trits," 3,980 of them 0s. 2. Scan through document d. Every time the word w appears, add the index vector i_d to row w of matrix G. NOTE: The same procedure will produce the matrix F if the 37,600 index vectors i_d are 37,600-bit and have only one 1, in the d-th position (i.e., unary or LOCAL representation): i_d = 00000000000...000+0000000000...000000000000000000000000000, that's 37,600 bits, 37,599 of them 0s.
EC2 Quirks
editGeneral Tricks
edit- Logging into ec2 instance/cluster from a diff machine
- Run ec2-describe-instances and use the public DNS ( aka ec2-x-x-x-x.compute-1.amazonaws.com ) just after the <cluster>-master line.
- ssh -i ~/.ec2_keys/id_rsa-ec2-keypair root@ec2-x-x-x-x.compute-1.amazonaws.com
- scp get
- scp -i ~/.ec2_keys/id_rsa-ec2-keypair root@ec2-x-x-x-x.compute-1.amazonaws.com:/root/log.txt .
- Can use 'cluster ssh'
- Write a script that uses soap to get a list of all running instances and you can use cluster SSH to connect to and issue commands to all machines simultaneously
Use Oracle data mining
editSummarized from [2]
- Use Elasticfox and the following ami to boot instance : “ami-117b9778”
- Enable Local Port-Forwarding
- Get ec2-host's local ip address ( using ifconfig ), then run the following ssh-command
- ssh -4 -g -L <local-port-12345>:<local-ip-X.X.X.X>:1521 -i <ec2-keypair-file> root@<ec2-XX-XX-XX-XX.compute-1.amazonaws.com>
- To Use SQLPLUS
- ssh -i <ec2-keypair-file> root@<ec2-XX-XX-XX-XX.compute-1.amazonaws.com>
- su - oracle
- sqlplus dmuser/dmuser
- To Use OracleDataMiner GUI
- Download Oracle's odminer-GUI tool for Data Mining
- <unzip-dir>/odminer/bin/odminer
- Use following odminer settings :
- user:dmuser/pwd:dmuser
- host:localhost/port:<local-port-12345>/sid:odmdb
Regression Analysis
editBasic idea is based on OLS ( ordinary least squares fit ) of straight line. Extended to more than 1 dependent variables ( regressors ). Very applicable to inventory estimation/forecasting problem for any ad-network mobile networks specifically, since the inventories explode in mobile space because of device*carrier*geo aspects. For desktops device*carrier doesn't matter and geo almost remains the same for a particular inventory cell.
Linear Regression
edit- Simple_linear_regression is the uni-variate form
- Linear_regression is a multi-variate extension of the above
- Problems with multiple (linear) regression
- Not local ==> two completely independent (orthogonal) regressors affecting each other
- Problems with multiple (linear) regression
Scaling the Multiple (Linear) Regression Problem
edit- Exploit hierarchy in regressing dimensions
- Dance between being too macro and too micro
- Too micro ==> unstable regression
- Exhibit random behavior ( central limit theorem )
- Too macro ==> bad estimation
- Violates locality ( as well collapse the dimensions )
- Too micro ==> unstable regression
- Use LSH
- Control the dance via parameterization of hash functions
TV Advertising
edit- Can mix contextual with behavioral, based on currently-watching-pgm and past history with appropriate weights given to each of the algorithms.
- Can use user's EPG settings to pre-fetch ads for contextual ads, as we know what is the pgm user is gonna watch.
- Use CRF models as compared to Markovian models as we know the future behavior here.
Client based (STB) recommendation
edit- Needs to cluster different user-groups in the household
- TV is shared, hence superimposition of behaviors. ( kids, family, teens, parental, senior citizen, young-adult etc. )
- Identify current user-group ( or a mixture of groups )
- Based on temporal, contextual current data
- Based on current EPG settings ( user1,user2 etc. ), channels watching, frequency of activity, time-of-day etc.
- Different user-groups has different cut-off bar for showing recommendations/advertisements :
- For Kids : low bar ==> ( low precision, high recall ). Can show it to anyone. may not be effective for some.
- For Adults : high bar ==> dangerous to show adult advertisements to kids ==>( Extremely High precision, high recall )
- For finance/home-loan ==> moderate bar ( as everyone is moderately interested in finance etc. Low returns when showing it to kids though )
Bidirectional HMM
edit- Can model as bi-directional HMM
- STB sees the ads sent as observations
- Cable Network sees yes/no from STB as observations
OpenCV building guide
edit- Go to OpenCV Installation Guide
- Download source code for linux/mac
- Install Cmake 2.8+
- Download cmake source code
- Building instructions : ( basically ./bootstrap; make ; make install )
- Add /opt/cmake/bin/cmake in your $PATH ( thast where cmake is installed by default in linux ). In mac
- Install pkg-config
- in linux : "yum install pkgconfig"
- in Mac
- Download source code
- Instructions are in INSTALL fiel which comes with the src code ( steps : ./configure; make; make check; make install )
- Configure OpenCV by running cmake
- cd opencvdir; mkdir release; cd release;
- In redhat/linux : cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_NEW_PYTHON_SUPPORT=ON ..
- In Mac : cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_NEW_PYTHON_SUPPORT=ON -D BUILD_JAVA_SUPPORT=FALSE ..
- sudo make install
- For Python support
- I was able to compile successfully for python in Mac ( redhat has some issues I am still debugging )
- In mac ensure that you build with -D BUILD_NEW_PYTHON_SUPPORT=ON option while running cmake.
- after running "make install" , export PYTHONPATH=opencvdir/release/lib:opencvdir/samples/python:$PYTHONPATH ( there should be a cv2.so file under <opencvdir>/release/lib )
- now run python; import cv2;
Random Scratchpad
edit- Installations on python
- Use pip ( via easy_install )
- curl -O http://python-distribute.org/distribute_setup.py
- python distribute_setup.py
- sudo python distribute_setup.py
- sudo easy_install -v pip
- pip search libxml2; pip install lxml
- sudo pip install sympy==0.7.1
- Use easy_install ( instructions in http://pypi.python.org/pypi/setuptools)
- curl -O "http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea"
- mv setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea setuptools-0.6c11-py2.7.egg
- sh setuptools-0.6c11-py2.7.egg
- sudo easy_install -v lxml
- sudo easy_install -v simplejson
- Use pip ( via easy_install )
- Passwordless SSH/SCP logins
- To summarize, a personal private(.ssh/id_rsa)/public(.ssh/id_rsa.pub) key pair is generated using the ssh-keygen command. The public key is then copied onto a remote systems' .ssh/authorized_keys file. And you can now SSH to the remote systems's account without the use of a password. SSH uses client's .ssh/id_rsa as the priavte key and sends auth request to server which then checks its authorized_keys entries. If present it allows pwdless login
- On Client machine :
- ssh-keygen -t rsa
- scp .ssh/id_rsa.pub prasen@remotemachine:/homes/prasen/.ssh
- ssh prasen@remotemachine 'cat /homes/prasen/.ssh/id_rsa.pub >> /homes/prasen/.ssh/authorized_keys'
- rm .ssh/id_rsa.pub ( Optional )
Mahout QuickstartGuide
edit- svn co http://svn.apache.org/repos/asf/mahout/trunk
- cd trunk; mvn -s /dev/null -P fastinstall ( to avoid your custom settings.xml )
- Install and run hadoop in pdeudo-dist mode
- Run Clustering of synthetic control data
- curl -o synthetic_control.data https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html;
- hadoop fs -mkdir testdata; hadoop fs -put <dir>/syncthetic_control.data testdata
- export HADOOP_CONF_DIR=~/work/hadoop/hadoop-0.20.203.0/conf;export HADOOP_HOME=~/work/hadoop/hadoop-0.20.203.0
- bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
- Step-By-Step-Tutorial links:
MySql on Mac
edit- Very helpful tips
- https://trac.macports.org/wiki/howto/MAMP
- sudo chown -R mysql:mysql /opt/local/var/db/mysql5/
- sudo chown -R mysql:mysql /opt/local/var/run/mysql5/
- sudo chown -R mysql:mysql /opt/local/var/log/mysql5/
- http://hintsforums.macworld.com/showthread.php?p=616669
- https://trac.macports.org/wiki/howto/MAMP
GNUPLOT Tips
edit- Use Gnuplot in Mac
- Install Macports ( macport installs packages in /opt/local/bin )
- sudo port install gnuplot ( gnuplot installed in /opt/local/bin/gnuplot )
- The default aquaterm doesnt work in Mac 10.6
- xterm;gnuplot;set term xterm; plot 'dat' u 1:2 smooth cumulative; ( to see it on xterm )
- set term png;set output 'output.png'; plot 'dat' u 1:2 smooth cumulative; ( this will generate png file )
- Additional gnuplot tips
- set datafile separator '\t';
- Ranked Data plotting
- plot [][26:1] "data" using 2:0:ytic(1) ( http://www.manning.com/janert/SampleCh13.pdf )
- plot [1:100] '/tmp/nonus_sr.txt' u 3:8, '/tmp/us_sr.txt' u 3:8, '/tmp/nonus_hvr.txt'u 3:8, '/tmp/us_hvr.txt' u 3:8 ;
- Useful Links :
Solr Tips
edit- Simple Solr Setup
- cd $solr/example/solr
- ..\..\bin\solr start -f -V
- Solr Setup in a new Directory
- cp -r $solr/example/solr new_solr/.
- rm -rf new_solr/solr/collection1/data/*
- <solr-install-dir>\bin\solr start -f -V -s <full-path>\new_solr\solr
- Solr Multicore setup ( with sharding )
- cd $solr/example; mkdir -p my-multicore;
- create the folders in my-multicore: ./data/core0, ./data/core1, ./sharedInstanceDir
- cp -r multicore/core0/conf my-multicore/sharedInstanceDir
- create the solr.xml in my-multicore
<cores adminPath="/admin/cores"> <core name="core0" instanceDir="$solr/example/my-multicore/sharedInstanceDir" dataDir="$solr/example/my-multicore/data/core0"/> <core name="core1" instanceDir="$solr/example/my-multicore/sharedInstanceDir" dataDir="$solr/example/my-multicore/data/core1"/> </cores>
- java -Dsolr.solr.home=. -jar ../start.jar
- Search Link
- ( simple OR search on a specific field ) curl 'http://localhost:8983/solr/select?rows=2&fl=id' --data-urlencode 'q={!q.op=OR df=features_txt} 6173380581647087986'
- ( searching over all cores ) http://localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0,localhost:8983/solr/core1&q=*:*
- Update Link ( separate for each core ) :
- curl http://localhost:8983/solr/core0/update/csv?commit=true --data-binary @/tmp/tmp/tmp.txt.0.0 -H "Content-type:text/plain; charset=utf-8"
- curl http://localhost:8983/solr/core1/update/csv?commit=true --data-binary @/tmp/tmp/tmp.txt.0.0 -H "Content-type:text/plain; charset=utf-8"
- Populating solr with a csv file
- Make any csv file solr-compatible by adding id to the beginning of each line : nl csv.txt > solrcsv.txt ( It uses TAB as the default separator, use -s ',' for comma )
- curl "http://localhost:8983/solr/update/csv?commit=true&fieldnames=id,sec_t,slk_t,pos_i,country_t,fct_t,wlpr_t,dac_t,count_l&separator=%09" -H "Content-type:text/plain" --data-binary @solrcsv.txt
- wget -O - -v "http://localhost:8983/solr/update/csv?commit=true&fieldnames=id,name_txt,count_l,postcodes_t,bb_t,cg_t,imageryurl_t,zoomLevel_t,polygon_t&separator=%09" --header="Content-type:text/plain" --post-file=solr1.csv
- Verify content by
- Use of Dynamic fields
- Its mentioned in <solr-home>/conf/schema.xml and you can use it as myfield1_s for string, myfield2_i for int etc.
- Documentation in http://wiki.apache.org/solr/SchemaXml#Dynamic_fields Use it along with csvupdate. Very handy.
- Delete Link ( separate for each core, also different ways ) :
- curl http://localhost:8983/solr/update -d 'stream.body=<delete><query>*:*</query></delete>&commit=true'
- http://localhost:8983/solr/core0/update?stream.body=<delete><query>*:*</query></delete>&commit=true
- http://localhost:8983/solr/core1/update?stream.body=<delete><query>*:*</query></delete>&commit=true
B-trees Vs LSM trees Vs Fractal trees ( LevelDB/tokutekDB/Cassandra)
edit- Look at slides around 23. It merges LSM ideas with B-trees by having SStables/Message-buffers at each B-tree node. http://www.tokutek.com/wp-content/uploads/2012/01/20120109-boston-mysql-usergroup_posted.pdf
- One of the cleanest explanation of how SSTables/LevelDBs merge ( compaction algo )
- Writes/Adds are slow in B-tree, but fast with SSTables ( implementation of LSM tree )
- Reads are slow with LSM as reads spread over different SSTables and are all linear ( no binary searches ). Fractal trees try to make reads fast. http://en.wikipedia.org/wiki/Fractional_cascading
- Why are the writes of Cassandra executed faster than MySQL’s. InnoDB is using B+trees, while Cassandra is using LSM trees (BigTable is using LSM trees too).
- http://nosql.mypopescu.com/post/3063887666/writes-performance-b-tree-lsm-tree-fractal-tree
Cassandra Gyaan
editGit_SVN
editBasic Commands
edit- Git Pulls
- git pull ( to fetch from master )
- git pull origin <pmukherj_branch> ( to fetch from origin/remote of a different branch)
- Git Branch
- git branch ( to see current branches, and also your current branch )
- git branch -a ( to see ALL remote and local branches )
- git checkout <pmukherj_scratch>
- Git status
- git status
Python
editInstallation Tips
edit- Use http proxy in python
- urllib2.urlopen('http://google.com', proxies={'http':'127.0.0.1'})
- Printing stacktrace in Python
- import traceback
- try: <block>
- catch Excpetion, err:
- traceback.print_exc(file=sys.stderr)
- Pre requisites In WIndows
- Download and install easy_intsall
- Run 'C:\Python27\Scripts\easy_install pip' from commandline
- Install Mingw32 by downloading the executable
- Add C:\Mingw32\bin to PATH ( set PATH=%PATH%;C:\Mingw32\bin )
- Adding M9ingw32 will fix lotsof python setup/installation problems..
Helpful Coding Tricks
edit- Profiling in Python
- time python -m cProfile <your_python_code.py>
- One liner tricks
- Print last 4 fields of a TSV file
- python -c "import sys; tmp = lambda x: sys.stdout.write(x.split('\t')[:4]); map(tmp,sys.stdin)"
- Print last 4 fields of a TSV file
Python Data Sciences ToolKit
edit- Install Tools
- set HTTP_PROXY=<host:port> ( if proxy exists )
- First approach
- easy_install numpy
- pip3 install numpy
- Alternative( easy ) approach, if the above doesnt work on windows
- Download pre-compiled package numpy-1.10.4+mkl-cp27-cp27m-win32.whl
- Choose the correct file compatible to your CPU and Python version ( e.g. numpy-1.9.2+mkl-cp33-none-win_amd64.whl for python 3.3 and 64 bit windows )
- pip3 install <path-to-file-numpy-1.9.2+mkl-cp33-none-win_amd64.whl>
- Download pre-compiled package numpy-1.10.4+mkl-cp27-cp27m-win32.whl
- Essenstial liraries to install
- pandas ( pip3 install pandas )
- numpy
- scipy
- scikit-learn
- Install Theano/Keras
- Install Python2.7.11
- Install numpy ( for 32 bit )
- install scipy
- git clone https://github.com/Theano/Theano.git
- cd Theano
- python setup.py develop
- git clone https://github.com/fchollet/keras
- cd keras
- python setpu.py install
- Theano/Keras Documentation
- http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.pdf Shorter tutorial
- http://deeplearning.net/software/theano/theano.pdf 500+ page manual
- SVM model generation from csv
- Python code
Naive Bayesian ( 2 packages )
edit- Naive Bayesian scikit
- In Mac ( Because pip installs in /Library/Python/2.6/site-packages and native python uses /System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python]
- sudo port install py26-scikits-learn
- /opt/local/bin/python2.6
- Document classifier code ( 100 features/words, 6 samples)
- In Redhat
- pip install --upgrade numpy
- pip install scipy
- pip install scikit
- In Mac ( Because pip installs in /Library/Python/2.6/site-packages and native python uses /System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python]
Pig/Python Tricks
edit- Using Python one liners as streams
- DEFINE GET_QUERY `python -c "import urlparse, sys; f1 = lambda l:urlparse.parse_qs(l)['r'][0]; f2=lambda u:urlparse.parse_qs(u.split('http://www.flickr.com/search/?')[1])['q'][0];[sys.stdout.write(f2(f1(l))) for l in sys.stdin]"`;
- r1 = stream r through GET_QUERY;
- Testing :
- echo '/f?s=792600186&t=03bed0e95c1ba45ed897cbcc87f754af&r=http%3A%2F%2Fwww.flickr.com%2Fsearch%2F%3Fq%3Dsix%2Bflags%2Bmagic%2Bmountain%26d%3Dtaken-20040101-20051231%26ct%3D0%26mt%3Dall%26adv%3D1&fl_ev=0&lang=en&intl=us&sview=t&sscope=all&spage=9&sadv=t&sver=2.0' | python -c "import urlparse, sys; f1 = lambda l:urlparse.parse_qs(l)['r'][0];f2=lambda u:urlparse.parse_qs(u.split('http://www.flickr.com/search/?')[1])['q'][0];[sys.stdout.write(f2(f1(l))) for l in sys.stdin]"
- Using Python files as streams ( more readable/can print stacktrace etc, but requires an additional file )
- DEFINE GET_QUERY `python test.py`; ( test.py is below)
- import sys
- for l in sys.stdin: print l
- r1 = stream r through GET_QUERY;
- Testing :
- echo '/f?s=792600186&t=03bed0e95c1ba45ed897cbcc87f754af&r=http%3A%2F%2Fwww.flickr.com%2Fsearch%2F%3Fq%3Dsix%2Bflags%2Bmagic%2Bmountain%26d%3Dtaken-20040101-20051231%26ct%3D0%26mt%3Dall%26adv%3D1&fl_ev=0&lang=en&intl=us&sview=t&sscope=all&spage=9&sadv=t&sver=2.0' | python test.py
- DEFINE GET_QUERY `python test.py`; ( test.py is below)
Merkle/Hash Trees, Gossip Protocol, Anti-Entropy/Read Repair, Hinted handoff
edit- Anti-entropy, or replica synchronization, is the mechanism in Cassandra for ensuring that data on different nodes is updated to the newest version.
- Here’s how it works. During a major compaction (see Compaction), the server initiates a TreeRequest/TreeResponse conversation to exchange Merkle trees with neighboring nodes. The Merkle tree is a hash representing the data in that column family. If the trees from the different nodes don’t match, then they have to be reconciled (or “repaired”) in order to determine the latest data values they should all be set to.
- In Dynamo, they use a Merkle tree for anti-entropy (see Merkle Tree). Cassandra does too, but the implementation is a little different. In Cassandra, each column family has its own Merkle tree; the tree is created as a snapshot during a major compaction operation, and it is kept only as long as is required to send it to the neighboring nodes on the ring. The advantage of this implementation is that it reduces disk I/O.
- Cassandra uses Bloom filters to reduce disk access, which can be expensive, on key lookups. Every SSTable has an associated Bloom filter; when a query is performed, the Bloom filter is checked first before accessing disk. Bloom filters CANNOT have false negatives,
- Gossip protocol is used in consistent hashing to slowly move repartitioned data from old shard to new shard ( combining read-repair techniques )
Excel Tricks
edit- Create GUID in Excel
- =CONCATENATE(DEC2HEX(RANDBETWEEN(0,4294967295),8),"-",DEC2HEX(RANDBETWEEN(0,42949),4),"-",DEC2HEX(RANDBETWEEN(0,42949),4),"-",DEC2HEX(RANDBETWEEN(0,42949),4),"-",DEC2HEX(RANDBETWEEN(0,4294967295),8),DEC2HEX(RANDBETWEEN(0,42949),4))
Shell Tricks
edit- Set JAVA_HOME in mac : export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home
- for i in `ls *.jpg`; do /usr/bin/time -o t curl -s -o /dev/null "http://localhost:10080/getsimilartojpeg?numresults=1&filename=/home/prasen/SmartVision/data/books_image/$i"; done
- time time curl "http://localhost:10080/getsimilartojpeg?numresults=1&filename=/home/prasen/SmartVision/data/books_image/1111111-M.jpg
r = load 'flickr_dump'; r = foreach r generate $0; r = limit r 1000; DEFINE GET_QUERY `python -c "import urlparse, sys; f1 = lambda l:urlparse.parse_qs(l)['r'][0];f2=lambda u:urlparse.parse_qs(u.split('http://www.flickr.com/search/?')[1])['q'][0];[sys.stdout.write(f2(f1(l))) for l in sys.stdin]"`; r1 = stream r through GET_QUERY; STORE r1 INTO '/tmp/save.txt';
Important References
editSpatial Data Mining
editThe Data Mining Trinity Tasks
edit- classification
- for prediction, thematic classification
- spatial regression ( SAR )
- Markov Random field
- clustering
- Identify hotspots, outliers
- DBSCAN, hierarchical,
- association rules
- for relationship discover a --> b ==> X --> Y
Spatial Measures
edit- Mean, Dispersion
- Spatial Dependence. Moran's I A global measure of spatial autocorrelation.
Spatial Statistical Models
edit- Point Process
- Lattices
- Geostatistics. To model spatial variability using variograms. Variograms summarizes the relationship between differences in pairs of measurements and the distance between corresponding pairs. Krging is another estimation procesdure used in geostatistics.
Data Science Interview Questions
edit- Explain hypothesis testing and what is a p-value
- What do you mean by a ML model. What exactly is learned and how. Explain with an example.
- What is Maximum log likelihood and where it is used
- Estimation of Pi using Monte Carlo SImulation
- Probability of 3 ants colliding, starting from 3 corners of an equilateral triangle ( 2/8 )
- Probability of an element present in a sample of size r out of a total s elements
- K samples from an infinite stream
- Interesting Algorithmic problems
- https://medium.com/rants-on-machine-learning
- https://www.mapr.com/blog/some-important-streaming-algorithms-you-should-know-about
- Hashing and Sketching
- Hyper log log = count distinct
- Count min = count(s)
- Streaming k-means
- Quantiles via t-digest