

(Big?) Data Question: Searching in a large dataset - georgismilyanov

I am working on a model of social interactions in mice. I have mice and boxes and a simulation that outputs which mouse <i>stays</i> in which box during which time period. The problem is how to obtain, in the end and from this, the <i>meetings</i> of two mice which were in the same box in an overlapping period.<p>Right now I have a MySQL database where the simulation directly inserts each stay result. And then another tool,written in Scala, just retrieves all stay result, in portions of a few hundred, loops through them and for each asks the database which stays were overlapping with it, and inserts each pair into the database, like this:<p>`box`,`id1`, `res_id1`, `id2`, `res_id2`, `from`, `to`, `dt`, `typ`<p>This means that mice id1 and id2 were in the box &quot;box&quot; in the interval between &quot;from&quot; and &quot;to&quot; with duration &quot;dt&quot; and the meetings was of type &quot;typ&quot;. There could be four types of meetings depending on when each mouse was in the box (e.g. when the one entered and exited relative to the other). &quot;res_id1&quot; and &quot;res_id2&quot; tell which stay results were used to generate the meeting result.<p>Obviously, this is very inefficient. What would be a better way of doing it? I am not constrained to using a RDMS, but I though this would be the easiest as I am reading and further analysing the data in R afterwards. Would it make sense to output the stays in a text file and then use Hadoop to generate the meetings somehow? Or anything else?<p>For the duration of an approximately <i>one quarter of one trial</i> of the simulation I generate around 1.5 million stay results.
======
Someone
I expect your simulation simulates the movements of all mice simultaneously in
time order.

If so, it might be easier to directly create the overlap results from the
simulation. For example, when a mouse enters a box, check and remember what
other mice are there. When that mouse leaves, check again and create the
record for the stay.

If that isn't possible or feasible, check whether your database supports range
types and queries. It might be worthwhile to migrate to PostgreSQL just for
this (see [https://wiki.postgresql.org/images/7/73/Range-types-
pgopen-2...](https://wiki.postgresql.org/images/7/73/Range-types-
pgopen-2012.pdf))

~~~
georgismilyanov
It is not simultaneous. I use the Gillespie technique for stochastic
simulation. I select a mouse at random to change boxes, depending on its
transition rate compared to the transition rate of the system as a whole.

~~~
Someone
Are the mice interchangeable? If so, can you simulate one, get a distribution
of box occupancy from it, and compute interactions from that?

~~~
georgismilyanov
No. That's the point. It is a simulation of a complex system. The interactions
between the mice are everything.

------
aba_sababa
How about having a table where each row corresponds to a duration of time?

time | total | mice_ids (a json string. or use mongo)

And then, run through all your data and assign each mouse to a certain
duration (ie, minute one, minute two, etc), updating the total counts every
time as well as the mice_ids. Then, you could query by minute and see which
minutes had meetings in them (ie, select mice where total > 2)

~~~
georgismilyanov
Thank you! But I am thinking, could I be able to obtain easily the duration of
_each_ meeting using this setup? This is what I would ultimately need (to
construct the empirical cdf, social network, etc.)

~~~
aba_sababa
Hmm. How are you defining a meeting? Is a meeting a pair of mice? Or is it a
pair of mice with a time component? So a pair of mice + the duration of time
they were together?

~~~
georgismilyanov
Exactly.

There are four types of meeting: mouse 1 entered -> mouse 2 entered -> mouse 2
left -> mouse 1 left; mouse 1 entered -> mouse 2 entered -> mouse 1 left ->
mouse 2 left; mouse 2 entered -> mouse 1 entered -> mouse 1 left -> mouse 2
left; mouse 2 entered -> mouse 1 entered -> mouse 2 left -> mouse 1 left.

Then the meeting record contains the times the two were together and the
duration of this.

The dumb approach is very slow. Would it be a good idea to read all stay
results in Scala (or whatever) and try to do the searching in memory,
completely eliminating th db? But then the comparisons would be more
expensive...

------
mathattack
Try a columnar database solution for time series.

