Analyzing user feedback of on-line communities
- The economic success of the World Wide Web makes it a highly competitive environment for web businesses. For this reason, it is crucial for web business owners to learn what their customers want. This thesis provides a conceptual framework and an implementation of a system that helps to better understand the behavior and potential interests of web site visitors by accounting for both explicit and implicit feedback. This thesis is divided into two parts.
The first part is rooted in computer science and information systems and uses graph theory and an extended click-stream analysis to define a framework and a system tool that is useful for analyzing web user behavior by calculating the interests of the users.
The second part is rooted in behavioral economics, mathematics, and psychology and is investigating influencing factors on different types of web user choices. In detail, a model for the cognitive process of rating products on the Web is defined and an importance hierarchy of the influencing factors is discovered.
Both parts make use of techniques from a variety of research fields and, therefore, contribute to the area of Web Science.
An erasure-resilient and compute-efficient coding scheme for storage applications
- Driven by rapid technological advancements, the amount of data that is created, captured, communicated, and stored worldwide has grown exponentially over the past decades. Along with this development it has become critical for many disciplines of science and business to being able to gather and analyze large amounts of data. The sheer volume of the data often exceeds the capabilities of classical storage systems, with the result that current large-scale storage systems are highly distributed and are comprised of a high number of individual storage components. As with any other electronic device, the reliability of storage hardware is governed by certain probability distributions, which in turn are influenced by the physical processes utilized to store the information. The traditional way to deal with the inherent unreliability of combined storage systems is to replicate the data several times. Another popular approach to achieve failure tolerance is to calculate the block-wise parity in one or more dimensions. With better understanding of the different failure modes of storage components, it has become evident that sophisticated high-level error detection and correction techniques are indispensable for the ever-growing distributed systems. The utilization of powerful cyclic error-correcting codes, however, comes with a high computational penalty, since the required operations over finite fields do not map very well onto current commodity processors. This thesis introduces a versatile coding scheme with fully adjustable fault-tolerance that is tailored specifically to modern processor architectures. To reduce stress on the memory subsystem the conventional table-based algorithm for multiplication over finite fields has been replaced with a polynomial version. This arithmetically intense algorithm is better suited to the wide SIMD units of the currently available general purpose processors, but also displays significant benefits when used with modern many-core accelerator devices (for instance the popular general purpose graphics processing units). A CPU implementation using SSE and a GPU version using CUDA are presented. The performance of the multiplication depends on the distribution of the polynomial coefficients in the finite field elements. This property has been used to create suitable matrices that generate a linear systematic erasure-correcting code which shows a significantly increased multiplication performance for the relevant matrix elements. Several approaches to obtain the optimized generator matrices are elaborated and their implications are discussed. A Monte-Carlo-based construction method allows it to influence the specific shape of the generator matrices and thus to adapt them to special storage and archiving workloads. Extensive benchmarks on CPU and GPU demonstrate the superior performance and the future application scenarios of this novel erasure-resilient coding scheme.
Design of competitive paging algorithms with good behaviour in practice
- Paging is one of the most prominent problems in the field of online algorithms. We have to serve a sequence of page requests using a cache that can hold up to k pages. If the currently requested page is in cache we have a cache hit, otherwise we say that a cache miss occurs, and the requested page needs to be loaded into the cache. The goal is to minimize the number of cache misses by providing a good page-replacement strategy. This problem is part of memory-management when data is stored in a two-level memory hierarchy, more precisely a small and fast memory (cache) and a slow but large memory (disk). The most important application area is the virtual memory management of operating systems. Accessed pages are either already in the RAM or need to be loaded from the hard disk into the RAM using expensive I/O. The time needed to access the RAM is insignificant compared to an I/O operation which takes several milliseconds.
The traditional evaluation framework for online algorithms is competitive analysis where the online algorithm is compared to the optimal offline solution. A shortcoming of competitive analysis consists of its too pessimistic worst-case guarantees. For example LRU has a theoretical competitive ratio of k but in practice this ratio rarely exceeds the value 4.
Reducing the gap between theory and practice has been a hot research issue during the last years. More recent evaluation models have been used to prove that LRU is an optimal online algorithm or part of a class of optimal algorithms respectively, which was motivated by the assumption that LRU is one of the best algorithms in practice. Most of the newer models make LRU-friendly assumptions regarding the input, thus not leaving much room for new algorithms.
Only few works in the field of online paging have introduced new algorithms which can compete with LRU as regards the small number of cache misses.
In the first part of this thesis we study strongly competitive randomized paging algorithms, i.e. algorithms with optimal competitive guarantees. Although the tight bound for the competitive ratio has been known for decades, current algorithms matching this bound are complex and have high running times and memory requirements. We propose the algorithm OnlineMin which processes a page request in O(log k/log log k) time in the worst case. The best previously known solution requires O(k^2) time.
Usually the memory requirement of a paging algorithm is measured by the maximum number of pages that the algorithm keeps track of. Any algorithm stores information about the k pages in the cache. In addition it can also store information about pages not in cache, denoted bookmarks. We answer the open question of Bein et al. '07 whether strongly competitive randomized paging algorithms using only o(k) bookmarks exist or not. To do so we modify the Partition algorithm of McGeoch and Sleator '85 which has an unbounded bookmark complexity, and obtain Partition2 which uses O(k/log k) bookmarks.
In the second part we extract ideas from theoretical analysis of randomized paging algorithms in order to design deterministic algorithms that perform well in practice. We refine competitive analysis by introducing the attack rate
parameter r, which ranges between 1 and k. We show that r is a tight bound on the competitive ratio of deterministic algorithms.
We give empirical evidence that r is usually much smaller than k and thus r-competitive algorithms have a reasonable performance on real-world traces. By introducing the r-competitive priority-based algorithm class OnOPT we obtain a collection of promising algorithms to beat the LRU-standard. We single out the new algorithm RDM and show that it outperforms LRU and some of its variants on a wide range of real-world traces.
Since RDM is more complex than LRU one may think at first sight that the gain in terms of lowering the number of cache misses is ruined by high runtime for processing pages. We engineer a fast implementation of RDM, and compare it
to LRU and the very fast FIFO algorithm in an overall evaluation scheme, where we measure the runtime of the algorithms and add penalties for each cache miss.
Experimental results show that for realistic penalties RDM still outperforms these two algorithms even if we grant the competitors an idealistic runtime of 0.
Tree-width for first order formulae
- We introduce tree-width for first order formulae φ, fotw(φ). We show that computing fotw is fixed-parameter tractable with parameter fotw. Moreover, we show that on classes of formulae of bounded fotw, model checking is fixed parameter tractable, with parameter the length of the formula. This is done by translating a formula φ with fotw(φ)<k into a formula of the k-variable fragment Lk of first order logic. For fixed k, the question whether a given first order formula is equivalent to an Lk formula is undecidable. In contrast, the classes of first order formulae with bounded fotw are fragments of first order logic for which the equivalence is decidable. Our notion of tree-width generalises tree-width of conjunctive queries to arbitrary formulae of first order logic by taking into account the quantifier interaction in a formula. Moreover, it is more powerful than the notion of elimination-width of quantified constraint formulae, defined by Chen and Dalmau (CSL 2005): for quantified constraint formulae, both bounded elimination-width and bounded fotw allow for model checking in polynomial time. We prove that fotw of a quantified constraint formula φ is bounded by the elimination-width of φ, and we exhibit a class of quantified constraint formulae with bounded fotw, that has unbounded elimination-width. A similar comparison holds for strict tree-width of non-recursive stratified datalog as defined by Flum, Frick, and Grohe (JACM 49, 2002). Finally, we show that fotw has a characterization in terms of a cops and robbers game without monotonicity cost.
Long-term potentiation through calcium-mediated N-Cadherin interaction is tightly controlled by the three-dimensional architecture of the synapse
- Poster presentation: Twenty Second Annual Computational Neuroscience Meeting: CNS*2013. Paris, France. 13-18 July 2013.
The synaptic cleft is an extracellular domain that is capable of relaying a presynaptically received electrical signal by diffusive neurotransmitters to the postsynaptic membrane. The cleft is trans-synaptically bridged by ring-like shaped clusters of pre- and postsynaptically localized calcium-dependent adhesion proteins of the N-Cadherin type and is possibly the smallest intercircuit in nervous systems . The strength of association between the pre- and postsynaptic membranes can account for synaptic plasticity such as long-term potentiation . Through neuronal activity the intra- and extracellular calcium levels are modulated through calcium exchangers embedded in the pre- and postsynaptic membrane. Variations of the concentration of cleft calcium induces changes in the N-Cadherin-zipper, that in synaptic resting states is rigid and tightly connects the pre- and postsynaptic domain. During synaptic activity calcium concentrations are hypothesized to drop below critical thresholds which leads to loosening of the N-Cadherin connections and subsequently "unzips" the Cadherin-mediated connection. These processes may result in changes in synaptic strength . In order to investigate the calcium-mediated N-Cadherin dynamics at the synaptic cleft, we developed a three-dimensional model including the cleft morphology and all prominent calcium exchangers and corresponding density distributions [3-6]. The necessity for a fully three-dimensional model becomes apparent, when investigating the effects of the spatial architecture of the synapse , . Our data show, that the localization of calcium channels with respect to the N-Cadherin ring has substantial effects on the time-scales on which the Cadherin-zipper switches between states, ranging from seconds to minutes. This will have significant effects on synaptic signaling. Furthermore we see, that high-frequency action potential firing can only be relayed to the Calcium/N-Cadherin-system at a synapse under precise spatial synaptic reorganization.
Synaptic boutons sizes are tuned to best fit their physiological performances
Markus M. Knodel
- Poster presentation: Twenty Second Annual Computational Neuroscience Meeting: CNS*2013. Paris, France. 13-18 July 2013.
To truly appreciate the myriad of events which relate synaptic function and vesicle dynamics, simulations should be done in a spatially realistic environment. This holds true in particular in order to explain as well the rather astonishing motor patterns which we observed within in vivo recordings which underlie peristaltic contractionsas well as the shape of the EPSPs at different forms of long-term stimulation, presented both here, at a well characterized synapse, the neuromuscular junction (NMJ) of the Drosophila larva (c.f. Figure 1). To this end, we have employed a reductionist approach and generated three dimensional models of single presynaptic boutons at the Drosophila larval NMJ. Vesicle dynamics are described by diffusion-like partial differential equations which are solved numerically on unstructured grids using the uG platform. In our model we varied parameters such as bouton-size, vesicle output probability (Po), stimulation frequency and number of synapses, to observe how altering these parameters effected bouton function. Hence we demonstrate that the morphologic and physiologic specialization maybe a convergent evolutionary adaptation to regulate the trade off between sustained, low output, and short term, high output, synaptic signals. There seems to be a biologically meaningful explanation for the co-existence of the two different bouton types as previously observed at the NMJ (characterized especially by the relation between size and Po), the assigning of two different tasks with respect to short- and long-time behaviour could allow for an optimized interplay of different synapse types. We can present astonishing similar results of experimental and simulation data which could be gained in particular without any data fitting, however based only on biophysical values which could be taken from different experimental results. As a side product, we demonstrate how advanced methods from numerical mathematics could help in future to resolve also other difficult experimental neurobiological issues.
On functional module detection in metabolic networks
- Functional modules of metabolic networks are essential for understanding the metabolism of an organism as a whole. With the vast amount of experimental data and the construction of complex and large-scale, often genome-wide, models, the computer-aided identification of functional modules becomes more and more important. Since steady states play a key role in biology, many methods have been developed in that context, for example, elementary flux modes, extreme pathways, transition invariants and place invariants. Metabolic networks can be studied also from the point of view of graph theory, and algorithms for graph decomposition have been applied for the identification of functional modules. A prominent and currently intensively discussed field of methods in graph theory addresses the Q-modularity. In this paper, we recall known concepts of module detection based on the steady-state assumption, focusing on transition-invariants (elementary modes) and their computation as minimal solutions of systems of Diophantine equations. We present the Fourier-Motzkin algorithm in detail. Afterwards, we introduce the Q-modularity as an example for a useful non-steady-state method and its application to metabolic networks. To illustrate and discuss the concepts of invariants and Q-modularity, we apply a part of the central carbon metabolism in potato tubers (Solanum tuberosum) as running example. The intention of the paper is to give a compact presentation of known steady-state concepts from a graph-theoretical viewpoint in the context of network decomposition and reduction and to introduce the application of Q-modularity to metabolic Petri net models.
QuateXelero: an accelerated exact network motif detection algorithm
- Finding motifs in biological, social, technological, and other types of networks has become a widespread method to gain more knowledge about these networks’ structure and function. However, this task is very computationally demanding, because it is highly associated with the graph isomorphism which is an NP problem (not known to belong to P or NP-complete subsets yet). Accordingly, this research is endeavoring to decrease the need to call NAUTY isomorphism detection method, which is the most time-consuming step in many existing algorithms. The work provides an extremely fast motif detection algorithm called QuateXelero, which has a Quaternary Tree data structure in the heart. The proposed algorithm is based on the well-known ESU (FANMOD) motif detection algorithm. The results of experiments on some standard model networks approve the overal superiority of the proposed algorithm, namely QuateXelero, compared with two of the fastest existing algorithms, G-Tries and Kavosh. QuateXelero is especially fastest in constructing the central data structure of the algorithm from scratch based on the input network.
Integer point sets minimizing average pairwise L1 distance: What is the optimal shape of a town?
Erik D. Demaine
Sándor P. Fekete
- An n-town, n[is an element of]N , is a group of n buildings, each occupying a distinct position on a 2-dimensional integer grid. If we measure the distance between two buildings along the axis-parallel street grid, then an n-town has optimal shape if the sum of all pairwise Manhattan distances is minimized. This problem has been studied for cities, i.e., the limiting case of very large n. For cities, it is known that the optimal shape can be described by a differential equation, for which no closed-form solution is known. We show that optimal n-towns can be computed in O(n[superscript 7.5]) time. This is also practically useful, as it allows us to compute optimal solutions up to n=80.
Improved space bounds for strongly competitive randomized paging algorithms
- Paging is one of the prominent problems in the field of on-line algorithms. While in the deterministic setting there exist simple and efficient strongly competitive algorithms, in the randomized setting a tradeoff between competitiveness and memory is still not settled. Bein et al.  conjectured that there exist strongly competitive randomized paging algorithms, using o(k) bookmarks, i.e. pages not in cache that the algorithm keeps track of. Also in  the first algorithm using O(k) bookmarks (2k more precisely), Equitable2, was introduced, proving in the affirmative a conjecture in .
We prove tighter bounds for Equitable2, showing that it requires less than k bookmarks, more precisely ≈ 0.62k. We then give a lower bound for Equitable2 showing that it cannot both be strongly competitive and use o(k) bookmarks. Nonetheless, we show that it can trade competitiveness for space. More precisely, if its competitive ratio is allowed to be (Hk + t), then it requires k/(1 + t) bookmarks.
Our main result proves the conjecture that there exist strongly competitive paging algorithms using o(k) bookmarks. We propose an algorithm, denoted Partition2, which is a variant of the Partition algorithm byMcGeoch and Sleator . While classical Partition is unbounded in its space requirements, Partition2 uses θ(k/ log k) bookmarks. Furthermore, we show that this result is asymptotically tight when the forgiveness steps are deterministic.