[tor-dev] [GSOC 16] Ahmia status update #1
Ismael R
zma at riseup.net
Fri Jun 3 18:41:34 UTC 2016
Hi everyone,
I'm working on ahmia.fi, the hidden service search engine and you're
reading status update #1.
During the last two weeks i've been working on several things:
1/ Settle on a new structure for ahmia source code.
The official repository [1] contains all the code related to ahmia. Some
of this code is deprecated (solr is not used anymore), documentation
needed to be updated, so it needed a bit of cleanup anyway.
A structure with two repositories was chosen:
- ahmia-site [2] is going to contain the django website, configuration
to use it in production (apache, nginx, uwsgi) and documentation on how
to get the project running.
- ahmia-crawler [3] is going to contain scrapy bots, configuration +
documentation (elasticsearch, polipo)
I tried to keep all past commits when creating these repositories.
2/ Update documentation
See [2] and [3].
3/ Start to refactor the django project
The django project is going to be composed by two apps:
- search is going to be the search engine frontend + future API endpoints
- trends is going to be the statistics visualization frontend + future
API endpoints
Some logic is also going to move from the website source code to the
indexer part of the search engine (ex: removal of fake/banned domains).
You can see this work on the ahmia-site repository [2].
Note: The trends app is not yet done so it isn't visible online.
4/ Implement continuous integration with travis.CI
Tests are going to be automatically run on travis.CI.
I also consider to display test code coverage with coveralls.io but I
fear about people focusing on improving the coverage percentage at all
cost, which is not very good.
This work is going to be pushed during the week-end.
5/ Start to write a proposal with details on how to improve search
I have yet to write a much more readable document, but here are a couple
ideas:
- Regroup all data related to domains, stats, content into elasticsearch
so when can use it for search or insights
- What about a pagerank-like algorithm to estimate a webpage popularity
instead of tor2web popularity ?
- Improve search with human language thanks to elasticsearch [4]
- Use static boosting with popularity (or pagerank) field [5]
We have a meeting planned tuesday with all ahmia's contributors. I hope
to have a clean proposal by then to discuss it with them.
During the next two weeks, I plan to continue working on the same
things. I want to finish 1/ to 4/ as quickly as possible to start
working on search quality.
See you in two weeks :)
Ismael
[1] https://github.com/ahmia/search
[2] https://github.com/iriahi/ahmia-site
[3] https://github.com/iriahi/ahmia-crawler
[4]
https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html
[5]
https://marcobonzanini.com/2015/06/22/tuning-relevance-in-elasticsearch-with-custom-boosting/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20160603/178ec03f/attachment.sig>
More information about the tor-dev
mailing list