[metrics-team] Pad discussion about notifications and package names today at 14:30 UTC (was: Brainstorming better notifications for operational issues on Wednesday, November 1, 14:30 UTC)
Karsten Loesing
karsten at torproject.org
Wed Nov 8 10:23:15 UTC 2017
Quick reminder: we scheduled another pad discussion for today at 14:30
UTC at
https://storm.torproject.org/shared/Ou-1QRctynWbF4yedi-MfDsjImFMFSIEP20fbVGCPRa
On 2017-11-02 11:22, Karsten Loesing wrote:
> On 2017-11-01 17:48, Iain R. Learmonth wrote:
>> Hi,
>
> Hi!
>
>> On 01/11/17 15:47, Karsten Loesing wrote:
>>> The following notes are the result of a 1.25-hour brainstorming session.
>>> They should not be seen as final decisions on anything, but rather as
>>> starting points for further discussions.
>>
>> Sorry I missed this. I'm still getting the hang of this daylight savings
>> time thing.
>
> No worries. And thanks for adding your thoughts below. I'll respond to
> some of them inline, but we should probably take this thread to another
> pad session, with all those open questions.
>
>>> How we could monitor our services?
>>> 1. Log warnings or errors whenever we realize we're having an issue, and
>>> somehow send those errors/warnings via email.
>>
>> metrics-bot can listen on an Onion service and then relay notifications
>> to IRC. I'm planning to move the "production" metrics-bot to a DO
>> droplet soon so it would also be possible to have the notifications sent
>> over the Internet (ACL restricted).
>
> (I'm not entirely sure what you have in mind here.)
>
>>> 2. Periodically request public resources via web interfaces and perform
>>> basic smoke tests, without adding specific information just for the sake
>>> of better monitoring.
>>
>> metrics-bot already does this for generating microblog status updates
>> (from Onionoo) and will start doing this for metrics-web CSV files in
>> the future. It's only doing it for things that it consumes though,
>> monitoring is a side-effect.
>
> Sounds good.
>
>>> 3. Locally run checks on the hosts, including whether a given process is
>>> still running.
>>
>> We can use Nagios Remote Plugin Executor (NRPE) for this if the sysadmin
>> team is happy with that.
>
> Fine question, we could find out.
>
>>> 1. CollecTor
>>> - notifications about errors or warnings in logs
>>
>> Is there a regular expression we can match on?
>
> We could probably create one, yes.
>
>>> - learn when the disk almost runs full (currently provided by Tor's
>>> Nagios and by a warning in the logs
>> Is this using NRPE already?
>
> That's again a question for the admins that we should ask them when we
> have a better idea what we need.
>
>>> - learn when a collector process has died, either by checking locally
>>> whether the process still exists, by looking at logs for regular
>>> info/notice level entries, or by fetching the index.json and looking
>>> whether the "index_created" timestamp is older than 30 minutes/3 hours
>>
>> We can write a Nagios check for this. It would look very similar to the
>> existing check for Onionoo (fetching and parsing JSON).
>
> True. I'm a big fan of that idea, because it doesn't require us to make
> any changes on existing instances.
>
> I think iwakeh is more in favor of doing something with logs, which
> would allow us to monitor things more closely but which require access
> to the hosts.
>
>>> - learn when a data source has become stale by looking at
>>> "last_modified" timestamps contained in index.json or by looking at the logs
>>
>> As above.
>>
>>> 2. OnionPerf
>>> - Does one or more of the OnionPerf hosts not report recent measurements?
>>
>> As above, but parsing the HTML (my preference would be to do this with
>> bs4, it's in Debian stable).
>>
>>> 3. Onionoo
>>> - [deployed] Onionoo has a Nagios warning that fetches a minimal
>>> response and checks timestamps (which is the only way how we notice
>>> problems with the bridge authority), but cf. #23984
>>> - nusenu suggests via email (mostly as an onionoo user):
>>> - reachability (TCP)
>>> - service working (HTTP 200 vs. 404, 500,...) (via active probes and
>>> via log monitoring. Increase in 500 status codes?)
>>> - response times (significantly higher than usual?)
>>> - data updated? (i.e. onionoo data older than 4-5 hours should
>>> trigger an alert)
>>> - minimal sanity checks (i.e. /details should contain more than 5k
>>> relay records) [KL: note that we wouldn't have to fetch 5k records for
>>> this, we could just parse relays_skipped.]
>>
>> All of this could be implemented in the Nagios check.
>
> Agreed.
>
>>> 4. Statistics (part of metrics-web)
>>> - [deployed] metrics-web sends a short log twice per day,
>>
>> Is the log secret?
>
> Fine question! Maybe! It may contain parts that we found too sensitive
> to keep in sanitized descriptors, and those are certainly secret. We
> could split up such log messages into secret ones on info level and
> non-secret ones on warn level, and only publish warn and error logs. But
> we might miss something there. Maybe we should assume that logs remain
> secret.
>
>> Is there a regex we can match on?
>
> Not really. It's log output from various tools. But I think that's
> nothing we should attempt to solve in the monitoring tool, it's
> something we need to solve by cleaning up metrics-web more first.
>
>> If we can publish the log and have it fetched by a Nagios plugin, no one
>> has to read them every time.
>>
>>> 5. ExoneraTor
>>> - [deployed] ExoneraTor sends a message when it finds an existing lock
>>> file, etc.
>>
>> Does this happen often?
>
> Only when it breaks. Every few months?
>
>>> 6. Website (Tor Metrics, plus Atlas, ExoneraTor, Compass etc. until
>>> they're migrated)
>>
>> We should come up with a list of test URLs and expected responses,
>> response times, etc.
>
> Yes, good idea.
>
>>> 7. Bot
>>
>> This could be complicated, as there are many functions in the bot. For
>> now I don't think that this needs to be considered, and we can revisit
>> if/when it moves to a Tor machine.
>
> Okay. I guess I thought of something very simple, like seeing if it's
> still alive, just like the website checks above. But, happy to keep this
> out for now.
>
>>> 8. Notification service
>>> - Learn when the notification service itself goes down!
>>
>> What would we test for and how? This would depend on the tool.
>
> Test that the notification service is still alive. Bad news if it dies
> and we don't get any notifications about all the other stuff.
>
>> I'd rather not start thinking about the exact tool just yet, but that
>> was a good list of options that we can think about in the future.
>
> Sounds good!
>
> Let's schedule a follow-up meeting to move this forward. I'll bring this
> up today at the team meeting (attention: the UTC time stayed the same,
> your clock may have changed! :))
>
>> Thanks,
>> Iain.
>
> All the best,
> Karsten
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 528 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20171108/c0a2d8b9/attachment.sig>
More information about the metrics-team
mailing list