Discrete Categorical Random Sampling

Articles —> Discrete Categorical Random Sampling

Random sampling is a process that involves choosing one or more values from an underlying probability distribution. Often used in computer simulation, mathematics, and statistics, the probability distribution defines how frequently a number (or range of numbers) will be chosen. For example, a normal distribution has a center and width determined by the mean and standard deviation, respectively - sampling from a normal distribution results in values around the mean of the distribution being chosen more frequently than those further away from the mean.

The above types of random sampling involves continuous variables - however sampling can also be used to sample discrete values. In this article, the focus is how to sample from a pool of discrete values - the catch being that each word may have a different probability of being selected. A practical example of using this type of sampling might be 'smart' song selection, in which the next random song to play is based upon previously listened music.

One can visually depict this scenario by drawing a line, with each category having a segment on that line, the length of which represents the probability the value will be chosen. To randomly sample from this distribution, a random continuous value can be chosen (for simplicity we'll just use a uniform distribution) between 0 and the length of the line, the sampled discrete value is thus represented by the segment in which this value falls.

Below is a generic Java class demonstrating this technique - each item is first associated with a given value, the probability of that item being chosen being its value divided by the sum of all values. To sample from this discrete distribution, all one needs to do is choose a random number between 0 and the sum of all values, then iterate through all items until the randomly chosen value lies on the segment representing the item:




import java.util.HashMap;

import java.util.Iterator;

import java.util.Map;



/**

 * Class to allow for probability sampling from categories, each with a defined probability of selection. Items are added

 * to the sampler with a given sampling value - either the default supplied - and this value is used to sample

 * from the underlying items in a probabilistic way. The true probability of selection is the defined sampling value

 * of an item divided by the total sampling value of all items. 

 * @author Greg Cope

 *

 * @param <T> 

 */

public class CategoricalSampler<T> {

	

	private Map<T, Double> samplingValues = new HashMap<T, Double>();

	

	private double defaultSamplingValue;

	

	private double totalSamplingValue = 0;

	

	/**

	 * Constructs a new sampler object with a default probability of added items of 0.1

	 */

	public CategoricalSampler(){

		this(0.1);

	}

	

	/**

	 * Constructs a new sampler object with the defined default probability. 

	 * @param defaultSamplingValue

	 */

	public CategoricalSampler(double defaultSamplingValue){

		this.defaultSamplingValue = defaultSamplingValue;

	}

	

	/**

	 * Adds the given parameter item, using the default probability value. 

	 * @param item

	 */

	public void addItem(T item){

		addItem(item, defaultSamplingValue);

	}

	

	/**

	 * Adds the given item to this sampler with the given value. If 

         * the item already exists, it is replaced with the given value

	 * @param item

	 * @param probability

	 */

	public void addItem(T item, double value){

		if ( !samplingValues.containsKey(item) ){

			samplingValues.put(item, value);

			totalSamplingValue += value;

		}else{

			Double d = samplingValues.get(item);

			totalSamplingValue -= d;

			samplingValues.put(item, value);

			totalSamplingValue += value;

		}

	}

	/**

	 * Increments an items value by incrementor. If the value does

         * not exist within this sampler, it is added with incrementor inital value

	 * @param item

	 * @param incrementor

	 */

	public void incrementSamplingValue(T item, double incrementor){

		if ( !samplingValues.containsKey(item) ){

			addItem(item, incrementor);

		}else{

			double val = samplingValues.get(item);

			samplingValues.put(item, val+incrementor);

			totalSamplingValue += incrementor;

		}

	}

	

	/**

	 * Returns the sampling value of the parameter item, or -1 if the item is not part of this CategoricalSampler.

	 * @param item

	 * @return

	 */

	public double getSamplingValue(T item){

		if ( !samplingValues.containsKey(item) ){

			return -1;

		}

		return samplingValues.get(item);

	}

	

	/**

	 * Sets the sampling value of the given item. 

	 * @param item

	 * @param probability

	 */

	public void setSamplingValue(T item, double value){

		addItem(item, value);

	}

	/**

	 * Samples a discrete value from the underlying distribution. This method runs in linear time

	 * @return The sampled item. 

	 */

	public T sample(){

		if ( samplingValues.size() == 0 ){

			throw new IllegalStateException("Empty set");

		}

		double val = Math.random() * totalSamplingValue;

		Iterator<T> items = samplingValues.keySet().iterator();

		double total = 0;

		T returner = null;

		while ( items.hasNext() ){

			T item = items.next();

			double prop = samplingValues.get(item);

			if ( val >= total && val < total + prop ){

				returner = item;

				break;

			}

			total += prop;

		}

		return returner;

	}

	

	@Override

	public String toString(){

		return samplingValues.toString();

	}

	



}



There are no comments on this article.

Back to Articles


© 2008-2017 Greg Cope