[metrics-team] M.Sc projects at Edinburgh
tl
tl at rat.io
Sat Jan 9 17:08:53 UTC 2016
> On 09.01.2016, at 14:20, William Waites <wwaites at tardis.ed.ac.uk> wrote:
>
> On Fri, 8 Jan 2016 12:14:46 +0100, Karsten Loesing <karsten at torproject.org> said:
>> 3. Exposed bad relays
>> 4. Analytics Project
>> 5. Confidence intervals for user number estimates
>
> These three sound like they would fit well in the "data science" area
> where the school has a programme. #4 is probably too wide as it is
> but there could easily be sub-projects.
Yes, indeed. The first step in the analytics project was setting up the Big Data infrastructure and providing the raw Tor network data in a format readily useable by standard data anlytics tools. That work is practically done and now it has to be put to good use:
- We need aggragtion scripts for common tasks like number of relays or advertized bandwidths as well as more involved aggregations like number of users, stability of relays over certain periods etcetera.
- The analytics softwares that we chose support a number of different languages (R, Java, Scala, Python, SQL). It would be cool to have a collection of sample aggregation scripts in each of these languages.
- Another, more ambitious task would be to analyze possible aggregation needs, identify common components and develop scripts and preaggregated datasets for those, effectively developing a set of aggregation primitives that future users could build on when they need to analyze specific aspects of the network.
- Currently most of the aggregation is done with shell scripts and SQL (PostgreSQL). Porting these to the Big Data infrastructure would allow us to get a solid understanding of differences in performance and programming effort/code complexity.
- Our analytics softwares also provide a machine learning library and a graph computation extension. It would be very interesting to see if hitherto hidden patterns in the data can be extracted with their help.
We don’t have a lot of experience with these tools ourselves, we're just starting working with them. So we probably couldn’t give much practical guidance to your students I’m afraid. Still it would be greatly appreciated to get input and help from your students with this project!
Cheers,
thms
More information about the metrics-team
mailing list