Prof. Chandrajit Bajaj from University of Texas at Austin gave a talk at 1pm on May 29, 2015 on Low Discrepancy Samplings of High Dimensional Geometric Spaces with Applications.

This talk is hosted by Prof. David Mount, who I took his CMSC 754 Computation Geometry class.

Misc: In 1984, Prof. Mount visited Cornell, Prof. Bajaj took Prof. Mount to try Indian food; Prof. Bajaj played tennis very well 🙂

Abstract

Generating nearly uniform random point samples from geometric spaces is fundamental to several applications in the computational and data sciences. One natural measure of uniform sampling quality is discrepancy. In this talk I shall describe deterministic methods of constructing low discrepancy samplings of geometric spaces, with particular emphasis on motion groups. These include a deterministic method of constructing an m point set sampling of the rotation group SO(3) with discrepancy O((log^{2/3} m)/m^{1/3}) against collections of local convex sets (suitably defined under the Hopf Fibration). We then extend this construction to get an almost exponential improvement in size (from the trivial tensor product) of low discrepancy samplings in SO(3)^n. Using low discrepancy sets of size m for SO(3) we construct a collection of (mn/eps)^{O(log log m+(log log 1/eps)(log log log 1/eps))} points (as opposed to O(mn) size) and with discrepancy eps against the class of combinatorial rectangles. These low discrepancy samplings of product spaces coupled to non-equispaced SO(3) Fourier transforms provides efficient space-time and bounded error complexity solutions for high dimensional non-convex geometric optimization, numerical integration and uncertainty quantification. Example applications of this methodology shall be drawn from molecular bio-informatics.

Notes

Here is a some good resource to get a quick background overview about Low Discrepancy Sampling: http://planning.cs.uiuc.edu/node207.html: In some applications, selecting points that align with the coordinate axis may be undesirable. Therefore, extensive sampling theory has been developed to determine methods that avoid alignments while distributing the points uniformly. In sampling-based motion planning, grids sometimes yield unexpected behavior because a row of points may align nicely with a corridor in ${\cal C}_{free}$. In some cases, a solution is obtained with surprisingly few samples, and in others, too many samples are necessary. These alignment problems, when they exist, generally drive the variance higher in computation times because it is difficult to predict when they will help or hurt. This provides motivation for developing sampling techniques that try to reduce this sensitivity.

Let x be a measure space, let R be a collection of subsets of x that is called a range space. In most cases, R is chosen as the set of all axis-aligned rectangular subsets. With respect to a particular point set, P, and range space, R, the discrepancy for k samples is defined as $D(P,R) = sup { \frac{|P \cap R|}{k} – \frac{\mu(R)}{\mu(X)} }$

Discrepancy measures whether the right number of points fall into boxes. It is related to the chi-square test but optimizes over all possible boxes.

First research question: Given a model of protein, can we predict whether it fits to other organisms?

• Modeling Molecular Interactions (Reference: http://ww2.chemistry.gatech.edu/~lw26/structure/molecular_interactions/mol_int.html)
• Given two molecules, search for the best relative transformation and relative conformation that yields a complex with the minimum binding free energy
• Docking Search is based on an adaptively sampled Search space (Translation + Rotation + Flexibility)
• Docking Scoring is based on \DeltaE = E_{A+B} – (E_A + E_B)
• How to minimize that Docking Scoring energy?
• The goal is Max_{t,x} F_{A,B}^{SC}(t,r) = \in_R f_A^{SC}(x) T_x(\cdots)
• skin-skin overlap score corresponds to the real part of
• Molecular conformations: Flexibility
• Domain decomposition
• Motion graph
• Protein folding problem (Reference: http://www.sciencemag.org/content/338/6110/1042)
• The protein-folding problem was first posed about one half-century ago. The term refers to three broad questions: (i) What is the physical code by which an amino acid sequence dictates a protein’s native structure? (ii) How can proteins fold so fast? (iii) Can we devise a computer algorithm to predict protein structures from their sequences? We review progress on these problems. In a few cases, computer simulations of the physical forces in chemically detailed models have now achieved the accurate folding of small proteins. We have learned that proteins fold rapidly because random thermal motions cause conformational changes leading energetically downhill toward the native structure, a principle that is captured in funnel-shaped energy landscapes. And thanks in part to the large Protein Data Bank of known structures, predicting protein structures is now far more successful than was thought possible in the early days. What began as three questions of basic science one half-century ago has now grown into the full-fledged research field of protein physical science.
• Predicting Molecular Recognition (Docking)
• Given a pari of protein structures search over the flexible configurations of each to optimize the binding affinity of the complex

• High Dimensional Search Spaces
• Space of all possible composite transformations
• In general, the space of transformations is R^6 \times R^{3l} = R^{3l+6}
• With $n$ molecules, this means the search space is R^{O(nl)}
• Special cases
• Symmetry: the degree of freedom (DoF) of transformations is reduced based on the symmetry group
• Rigid: I = 0
• 3D rigid protein docking
• 6 degrees of freedom for motion
• rotation matrix
• translation vector
• non-convex optimization -> global search
• Affinity Based Scoring Optimization
• Fast Rotational Matching
• Rotational Invariance
• Sampling of Motion Groups (Manifold)
• T(k), the group of translations in k dimensions
• Docking / Assembly Prediction: Scoring functions
• According to Wikipedia: The thermodynamic free energy is the amount of work that a thermodynamic system can perform. The concept is useful in the thermodynamics of chemical or thermal processes in engineering and science. The free energy is the internal energy of a system minus the amount of energy that cannot be used to perform work. This unusable energy is given by the entropy of a system multiplied by the temperature of the system. Like the internal energy, the free energy is a thermodynamic state function. Energy is a generalization of free energy, since energy is the ability to do work which is actually free energy.
• Typical energetic scoring functions are non-convex
• Poisson-Boltzmann Equation
• The Poisson–Boltzmann equation is a useful equation in many settings, whether it be to understand physiological interfaces, polymer science, electron interactions in a semiconductor, or more. It aims to describe the distribution of the electric potential in solution in the direction normal to a charged surface. This distribution is important to determine how the electrostatic interactions will affect the molecules in solution. From the Poisson–Boltzmann equation many other equations have been derived with a number of different assumptions.
• Boundary Integral Equation Reformulation
• Monte Carlo Sampling (refer to Prof. David Mount’s note)
• $error = \frac{\delta(f)}{\sqrt(N)}$
• Quasi Monte Carlo
• Koksma-Hlawka inequality
• Importance Sampling
• Representations of SO(3)
• Mobius Band
• Hopf Fibration
• Discrepancy and Dispersion

Note taker: Ruofei Du

Paper Reference:

1411.7753v1