[tor-dev] Metrics Plans
Kostas Jakeliunas
kostas at jakeliunas.com
Mon Jun 10 21:56:13 UTC 2013
Ah, forgot to add my footnote to the dirspec - we all know the link, but in
any case:
[1]: https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt
This was in the context of discussing which fields from 2.1 to include.
On Tue, Jun 11, 2013 at 12:34 AM, Kostas Jakeliunas
<kostas at jakeliunas.com>wrote:
> > Here, I think it is realistic to try and use and import all the fields
>> available from metrics-db-*.
>> > My PoC is overly simplistic in this regard: only relay descriptors, and
>> only a limited subset of data fields is used in the schema, for the import.
>>
>> I'm not entirely sure what fields that would include. Two options come
>> to mind...
>>
>> * Include just the fields that we need. This would require us to
>> update the schema and perform another backfill whenever we need
>> something new. I don't consider this 'frequent backfill' requirement
>> to be a bad thing though - this would force us to make it extremely
>> easy to spin up a new instance which is a very nice attribute to have.
>>
>> * Make the backend a more-or-less complete data store of descriptor
>> data. This would mean schema updates whenever there's a dir-spec
>> addition [1]. An advantage of this is that the ORM could provide us
>> with stem Descriptor instances [2]. For high traffic applications
>> though we'd probably still want to query the backend directly since we
>> usually won't care about most descriptor attributes.
>>
>
> In truth, I'm not sure here, either. I agree that it basically boils down
> to either of the two aforementioned options. I'm okay with any of them. I'd
> like to, however, see how well the db import scales if we were to import
> all relay descriptor fields. There aren't a lot of them (dirspec [1]), if
> we don't count extra-info of course and only want to deal with the Router
> descriptor format (2.1). So I think I should try working with those
> fields, and see if the import goes well and quickly enough. I plan to do
> simple python timeit / timing report macroses that may be attached /
> deattached from functions easily, would be simple and clean that way to
> measure things and so on.
>
> > [...] An advantage of [more-or-less complete data store of descriptor
> > data] is that the ORM could provide us
>
> > with stem Descriptor instances [2]. For high traffic applications
> > though we'd probably still want to query the backend directly since we
> > usually won't care about most descriptor attributes.
>
> I can try experimenting with this later on (when we have the full / needed
> importer working, e.g.), but it might be difficult to scale indeed (not
> sure, of course). Do you have any specific use cases in mind? (actually
> curious, could be interesting to hear.) [2] fn is noted, I'll think about
> it.
>
>
> > The idea would be import all data as DB fields (so, indexable), but it
>> makes sense to also import raw text lines to be able to e.g. supply the
>> frontend application with raw data if needed, as the current tools do. But
>> I think this could be made to be a separate table, with descriptor id as
>> primary key, which means this can be done later on if need be, would not
>> cause a problem. I guess there's no need to this right now.
>>
>> I like this idea. A couple advantages that this could provide us are...
>>
>> * The importer can provide warnings when our present schema is out of
>> sync with stem's Descriptor attributes (ie. there has been a new
>> dir-spec addition).
>>
>> * After making the schema update the importer could then run over this
>> raw data table, constructing Descriptor instances from it and
>> performing updates for any missing attributes.
>>
>
> The 'schema/format mismatch report' idea sounds like a really good idea!
> Surely if we are to try for Onionoo compatibility / eventual replacement,
> but in any case, this seems like a very useful thing for the future. I will
> keep this in mind for the nearest future / database importer rewrite.
>
> > * After making the schema update the importer could then run over this
> > raw data table, constructing Descriptor instances from it and
> > performing updates for any missing attributes.
>
> I can't say I can easily see the specifics of how all this would work, but
> if we had an always-up-to-date data model (mediated by Stem Relay
> Descriptor class, but not necessarily), this might work.. (The ORM <-> Stem
> Descriptor object mapping itself is trivial, so all is well in that regard.)
>
> On Wed, May 29, 2013 at 5:49 PM, Damian Johnson <atagar at torproject.org>wrote:
>
>> > Here, I think it is realistic to try and use and import all the fields
>> available from metrics-db-*.
>> > My PoC is overly simplistic in this regard: only relay descriptors, and
>> only a limited subset of data fields is used in the schema, for the import.
>>
>> I'm not entirely sure what fields that would include. Two options come
>> to mind...
>>
>> * Include just the fields that we need. This would require us to
>> update the schema and perform another backfill whenever we need
>> something new. I don't consider this 'frequent backfill' requirement
>> to be a bad thing though - this would force us to make it extremely
>> easy to spin up a new instance which is a very nice attribute to have.
>>
>> * Make the backend a more-or-less complete data store of descriptor
>> data. This would mean schema updates whenever there's a dir-spec
>> addition [1]. An advantage of this is that the ORM could provide us
>> with stem Descriptor instances [2]. For high traffic applications
>> though we'd probably still want to query the backend directly since we
>> usually won't care about most descriptor attributes.
>>
>> > The idea would be import all data as DB fields (so, indexable), but it
>> makes sense to also import raw text lines to be able to e.g. supply the
>> frontend application with raw data if needed, as the current tools do. But
>> I think this could be made to be a separate table, with descriptor id as
>> primary key, which means this can be done later on if need be, would not
>> cause a problem. I guess there's no need to this right now.
>>
>> I like this idea. A couple advantages that this could provide us are...
>>
>> * The importer can provide warnings when our present schema is out of
>> sync with stem's Descriptor attributes (ie. there has been a new
>> dir-spec addition).
>>
>> * After making the schema update the importer could then run over this
>> raw data table, constructing Descriptor instances from it and
>> performing updates for any missing attributes.
>>
>> Cheers! -Damian
>>
>> [1] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt
>> [2] This might be a no-go. Stem Descriptor instances are constructed
>> from the raw descriptor content, and needs it for str(), get_bytes(),
>> and signature validation. If we don't care about those we can subclass
>> Descriptor and overwrite those methods.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130611/47b9f670/attachment-0001.html>
More information about the tor-dev
mailing list