# Managing operations in SaaS

### Deployments

All deployments are automated thanks to our [Ansible playbooks](https://www.ansible.com/).

All our playbooks are versioned, maintained and reviewed by the Toucan tech team.

{% hint style="success" %}
**Benefits**

As a best practice, we **never** directly connect to a Toucan node in order to run manual commands. We reduce the risk of human error and deployment is able to auto-scale.
{% endhint %}

### Log Management

All logs generated by Toucan’s applications are centralized in an [Elastic stack](https://www.elastic.co/products). Toucan’s tech team can follow activity on these apps as well as any warnings and errors thanks to [Kibana](https://www.elastic.co/kibana) dashboards.

We also have a centralized system logs management which consolidates all our logs (e.g.: syslog, auth, nginx access/error, security tools…). We are able to detect brute force attacks, spam and malicious behavior on our dashboards. For each detected pattern, we receive automated alerts thanks to [Elastalert](https://github.com/Yelp/elastalert).

Our log retention policy is about 8 weeks long in our [Elastic stack](https://www.elastic.co/products), but we keep - by default - 14 weeks of web access/error logs and 52 weeks of app logs in our servers.

{% hint style="success" %}
**Benefits**

As a best practice, we **never** need to directly connect to a Toucan node to follow activity and logs. This is at the core of our ability to scale our monitoring.
{% endhint %}

### Vulnerability Scans

We regularly and automatically scan our servers in search for:

* open ports
* misconfiguration
* lack of security updates

{% hint style="success" %}
**Benefits**

We ensure having an up-to-date environment (system, security, patches…).
{% endhint %}

### Alerts

Our monitoring services alert us when:

* a server becomes unresponsive
* a server shows unusual CPU, memory or disk activity
* a server is getting closer to its hardware limits
* an application status page shows that it’s not OK
* one of the following ports is not listening: 443/80/22

Furthermore, we use [OSSEC](https://www.ossec.net/) to alert us to possible intrusions.

{% hint style="success" %}
**Benefits**

This monitoring runs 24/7 and every alert is checked to ensure a fast reaction from the Toucan tech team.
{% endhint %}

### Performance Monitoring

Our infrastructure and applications are continuously monitored by several external services (like [NewRelic](https://newrelic.com/), [Sentry](https://sentry.io/), [StatusCake](https://www.statuscake.com/) or [BetterUptime](https://betterstack.com/better-uptime)).

Every week these services send us detailed performance and uptime reports.

{% hint style="success" %}
**Benefits**

These regular reports help us to identify potential regressions or bottlenecks that can then be fixed.
{% endhint %}

### Watch and patch management

To discover new vulnerabilities and patch against them as quickly as possible, we follow:

| Item        | Name             | Descriptif                                                                                                                                                                |
| ----------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Database    | MongoDB          | [Mongo CVE DAdministration, exploitation & internal security details](http://www.cvedetails.com/vulnerability-list/vendor_id-12752/product_id-25450/Mongodb-Mongodb.html) |
| Database    | MongoDB          | [Mongo Security Checklist](https://docs.mongodb.com/v5.0/administration/security-checklist/)                                                                              |
| Application | Python           | [Python CvAdministration, exploitation & internal security Database](https://www.cvedetails.com/vulnerability-list/vendor_id-13534/product_id-28125/Docker-Docker.html)   |
| Container   | Docker           | [Docker Dev Mailing list](https://groups.google.com/forum/#!forum/docker-dev)                                                                                             |
| Container   | Docker           | [Docker User Mailing list](https://groups.google.com/forum/#!forum/docker-user)                                                                                           |
| System      | Ubuntu           | Ubuntu LTS packages                                                                                                                                                       |
| System      | Ubuntu           | [Ubuntu Security List](https://www.ubuntu.com/usn/)                                                                                                                       |
| System      | Debian           | [Debian Security List](https://www.debian.org/security/)                                                                                                                  |
| System      | InstallingfSense | Auto-updates                                                                                                                                                              |

And Github's issues/announces of main projects we use.

As soon a security patch is available, we automatically applied it to our whole infrastructure by using our [Ansible](https://www.ansible.com/) scripts.

Otherwise, our infrastructure is fully updated every 2 months with our [Ansible playbooks](https://www.ansible.com/). But before applying updates everywhere, we use a staging node to be sure there will be no regression.

These update processes can very occasionally lead to a short downtime that we do out of office hours.

If the infrastructure or the applications are impacted by a known vulnerability, we always send a mail report to the client to warn and explain how we recover it.

### Backups

We run a daily backup process for each instance/project.

The backup is a complete snapshot which is encrypted by a GPG key (dedicated to the instance/project) and exported over `rsync+ssh` to our exclusive backup nodes.

GPG keys are only available to the Toucan’s admins and stored in our passwords manager system.

All the backups are exported to dedicated storage service.

By default we keep a retention of 20 daily backups for each instance/project.

{% hint style="success" %}
**Benefits**

We also regularly test and challenge our backup and restoration scripts.

Restoring an instance or a project is a fully automated and fast process.
{% endhint %}

### Issues logbook

By culture, we keep a logbook of every issue on the infrastructure.

Each logbook entry describes:

* what’s going on
* how did we understand the issue
* what did we do to solve the problem
* what are the impacts
* what do we need to do to avoid it next times

{% hint style="success" %}
**Benefits**

The logbook is open to every Toucan employees. The knowledge, about the life and the issues on the infrastructure, is shared and maintained by every one.
{% endhint %}

### Communication during issues

As soon as we detect an issue, your dedicated account manager and/or client success manager will contact you to explain the issue, the potential impacts and give you an estimated resolution time.

When the issue is closed, you can expect a post-mortem report, mainly extracted from our logbook (cf previous paragraph), with details about the investigation and the resolution process.

This emergency communication is available 24x7.

Your instance has a dedicated status page at the following URL `{instance-name}.status.toucantoco.com` on which you can check the status of the services, find information about scheduled maintenances and consult incident reports. You can subscribe to this page by email in order to receive updates.

### Support

Our main support channels are emails via [help@toucantoco.com](mailto:help%40toucantoco.com).

This support is open between 9:00 and 18:00 (Paris hour) during the working days.

### On-Call duty team

We have a dedicated “on-call duty” team (level 1,2,3) at night and on weekends to watch and fix major issues.

### Project instance and server decommission

Each time we need to decommission a project instance:

* the dedicated stack is shutdown (virtual hosts, API process, workers, queue server, database)
* all data, logs and associated configuration are erased

Each time we need to decommission a server:

* data and home partition are fully formatted
* we force a basic rewrite of the partition (with a basic `dd` command), thus no block could be restored from their previous state
* then we release the server to Scaleway.

{% hint style="success" %}
**Benefits**

A decommissioned server is **always** left **without any data**.

We have **exactly** the same approach for any SAN or storage volumes.
{% endhint %}

### Container runtime security monitoring

To ensure that our customer’s instances are not compromised while running and detect any suspicious behavior which can lead to security issue, Toucan uses [Falco](https://falco.org/) on its infrastructure to monitor running containers and hosts in real-time.

Whenever [Falco](https://falco.org/) detects a scenario that is not on the Toucan team’s whitelist, it sends an alert to the team.

Here are some examples of suspicious activities:

* RCE (Remote Code Execution) inside a container.
* Package installation during the runtime of a container.
* Shell binding to a suspicious file descriptor.
* Netcat Program runs inside a container that allows remote code execution.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs-v3.toucantoco.com/administration/instance-management/managing-operations-in-saas.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
