Confluence clustering for scalability
To accommodate 50.000 users and more, a two node Confluence Cluster is required.
Assumptions
The following assumptions are made:
- Confluence will be used in Private Intranet / Non-Public Instance.
- Linux Operating System can be used.
- MySQL Database can be used.
- Apache Webserver can be used.
- Microsoft Active Directory as centralized user management.
- Clustering for scalibility only, not HA (though the model provides some level of redundancy).
Topics
Building a Confluence Cluster involves the following topics:
Cluster topology
We suggest the following topology for a two node Confluence cluster:
General (Hardware) requirements
- 2 dedicated, physical machines for the Confluence nodes
- 1 dedicated, physical machine for the MySQL Master database
- 1 dedicated machine for the MySQL Slave database
- 1 dedicated, physical machine for Load Balancing
- The Hardware and Software of both Confluence Nodes need to be exacly the same
- Suggested Hardware for both Confluence Nodes:
- 2x QuadCore CPU >= 2,5 Ghz or 1x 8-Core CPU >= 2,5 Ghz
- 20 GB RAM
- Fast HDD (e.g. SDD) for Search Index (32-64 GB)
- Suggested Hardware for the staging server:
- 2x QuadCore CPU >= 2,5 Ghz or 1x 8-Core CPU >= 2,5 Ghz
- 32 GB RAM
- Large HDD (500-1000GB) RAID10
- Suggested Hardware for Master database:
- 2x 8-Core CPU >= 2,5 Ghz
- 32GB RAM
- Fast + Large HDD (500-1000GB) RAID10
- Suggested Hardware for Slave database:
- 1x 8-Core CPU >= 2,5 Ghz
- 16GB RAM
- Fast + Large HDD (500-1000GB) RAID10
- Suggested Hardware for the Load Balancer:
- 1x QuadCore CPU >= 2,5 Ghz
- 8GB RAM
Cluster concept
The Cluster consists of two dedicated, physical Confluence Nodes (Node 1 + 2) and a separate staging server, a dedicated, physical MySQL database together with a slave database, and an Apache2 Webserver with mod_jk for Load Balancing. Between the two Confluence nodes is a (direct or via VLAN separated) Gigabit network connection for Cluster syncronization. Between both databases is a Master/Slave replication. All substential data is stored in the database, including attachments, and is replicated to a second MySQL Server. The Active Directory is used as central user management, including groups.
The second Confluence node and the staging server will be a clone of the first node. This makes it faster and easier to setup and ensures configuration, software and versions are the all same on all systems. This should be done so with both the MySQL Master and Slave (Note that specific configuration such as networking, hostname, Confluence cluster configuration, Master/Slave Configuration for MySQL is still required).
LVM (Logical Volume Manager) will be used on all systems as a layer between the disk drive and the file system to allow snapshots (see below for what purpose). Partion layout should be as follows:
Root Partion - /
Confluence install - /opt
Confluence home - /var/opt/confluence
MySQL - /var/lib/mysql
Backup
As all relevant data is stored in the database, including attachments, it is sufficent to backup the database. Changes to the installation files can be tracked and backed up in a repository. The Confluence internal backup is not used. Because of the amount of data that can be expected (attachments require a lot of space in the database), it is best for performance and recovery to backup the RAW MySQL files. As all data is replicated to the slave database, it is easy to consistently backup the RAW files, because the MySQL Slave can safely be stopped and started as desired and will sync up automatically. Note however, that it is only possible to a certain extend of time, depending on how much has happend since the slave was not in sync with the master.
In this case we would suggest the following backup strategy:
- Shutdown of MySQL on Slave
- Backup RAW files with whatever backup strategy is prefered
- Start MySQL on Slave to sync up with MySQL Master
This strategy however does not support Point-in-Time recovery. If this is required, we suggest using a software like Zmanda Recovery Manager, which allows restoring MySQL data from a given point in time. This software needs to be installed on the MySQL Master with binary logs enabled (which are enabled anyway, they are needed for replication). It provides a simple Web-GUI for setting up backup and allow easy restoring at a given point in time.
Futhermore, because of the Master/Slave architecture, there is already a live backup which can instantly be used in case one of the Master/Slave suffer a hardware failure. This also adds some redundancy to the cluster topology.
Last, we found that - as long as no hardware failure happens - we rarely need to recover from a backup. The trash functionality as well as the revisioning of pages almost always is sufficent to recover 'accidentally' deleted pages.
Security
The following security measures are applied:
- SSL encryption for both HTTP- and LDAP-Traffic.
- Confluence is configured to run as an unpriviledged user.
- Secure Tomcat behind Apache2 Webserver. In this case it is secured via the Load Balancer. All Traffic between the two nodes is handled by the Load Balancer.
- Completely disable the remote API (if is isnt't needed) or restrict access to the API to certain IPs or via BasicAuth in the Apache2 Load Balancer (if this is desired by the customer).
- Restrict or secure access to Admin Console in the Load Balancer as stated above (if this is desired by the customer).
- Configure the LDAP filter to exclude users which have the 'disabled' flag set.
- Disable public signup.
- Make sure XSS / CSRF protection is on.
If the instance is facing the internet (though this was excluded because of the assumptions, but we feel this is interesting nonetheless):
- Fail2ban can be set up for protection against brute-force login attacks.
- On country-specific instances, the iptables module 'geoip' can be used to block certain countries from accessing Confluence (and the server) at all.
- Disabling the People Directory.
- Enable Captcha for failed logins and Spam protection.
- Configure E-Mail visibility.
Furthermore we recommend the following guidelines to our customers:
- The 'confluence-administrators' group should contain only users who definitely require full administration access. Ideally, all administrators have a separate administrator login (e.g. user account is 'avrienen', then admin account is 'avrienen-admin'). The same applies to the 'System Administrator' permission.
- Regularly perform audits: Check and cleanup global, spaces and page permissions. Remove permissions which are not required. The same applies to group memberships.
- Subscribe to the 'Technical Alerts' mailing list (also, //SEIBERT/MEDIA monitors the mailing list and informs customers if they are affected by a certain security bug).
- Establish some sort of 'policy' for plugins in Confluence. Because Confluence is a strong/fast evolving product, not all plugins keep up with all releases of Confluence. This can be a problem, for example when administrators desire to update to a newer release but cannot update because a certian plugin is not compatible. If we are asked by our customers which plugins can be installed, we give them the rule of thumb to only install plugins which are officially supported by Atlassian OR are commercially available (or Support, that is) OR have been around in the Plugin Exchange for some time and are known to be developed in the future (we have good experience with plugins from Adaptavist and CustomWare). This is not strictly a security concern, but important nontheless.
Monitoring
Monitoring the cluster topology is vital and consists of several points that can be devided into availibility, performance and error monitoring.
Monitoring for availibility
Monitoring for availibility include periodic checks (say every 1-2 minutes) of
- Processes: Are certain processes like tomcat, apache, mysql alive and running (e.g. not in zombie or stale mode)?
- Network: Are all cluster components online (ping check), latency check, are all network interfaces alive?
- HTTP: Check for patterns and HTTP response codes.
Monitoring for availibility is vital in (upcoming) larger installations because longer non-reachability can be fatal for adoption or impact productivity of employees.
The most established solution for availibility monitoring is Nagios. We suggest using Nagios or it's fork Icinga.
Monitoring for performance
- Load Balancer Level: Logging access and the number of requests/sec., Number of Apache processes and threads, Apache Traffic Volume.
- System Level Monitoring: This applies to all machines in the cluster topology. This monitoring is required to see shortcomings and problems on system Level. Checks involve: CPU Load metrics, Memory usage, Swap usage, I/O throughput, Network traffic, Disk usage, Inode usage, Number of active network connections, Number and types of interrupts, Average load.
- Database Monitoring: Number of queries per second, Number of slow queries per second, data Throughput, Insert/Select Ratio, Number of threads.
- Confluence/Tomcat Monitoring: This special monitoring is crucial to see problems and effects of tuning on the application level. Important metrics: Maximum Heap Memory, Head Memory used, Commited Heap memory, Thread count, Peak Thread count, Loaded classes, Unloaded classes, Commited and used memory of Eden space, OldGen, Survivor space and PermGen, Page Faults, Number of Errors, Garbage Collection statistics, Number of JDBC connections.
Most important checks are simple ones: CPU usage, Memory usage, Usage of Java Heap Space, Number of database connections/queries and network latency because these metrics often are the first ones which help identify problems.
We suggest to mainly use Munin for automated performance monitoring, and Javamelody or Hyperic for Confluence/Tomcat monitoring. Simple, manual checks can also be done through the Confluence Admin Console, which includes statistics about Min/Max Heap Size, Heap size used, database latency cache statistics and other.
Monitoring for errors
Monitoring for errors involves mainly watching the Confluence logs for Java application errors. Recent versions of Confluence provide 'Herkules' for manual error checking. A simple, automated solution for watching fatal errors in Confluence's logs could be a combination of Unix tools like 'tail', 'awk' and 'mail' to send emails if problems arise.
Tuning
In conjunction with the general requirements and cluster concept mentioned above, the following general 'best practices' are applied to ensure best performance:
- Using the latest version of Java 6 JDK 64-Bit.
- Using the latest version of Confluence.
- Using the latest version of MySQL database connector.
- Allocate enough memory to Confluence and avoid swapping at all cost. As a rule of thumb, we allocate ~85% of RAM to Confluence on a Linux machine (whereas on a Windows Server 2008 machine it is less - around 70%)
- Avoid virus scanners or exclude them from the Confluence home directory.
- Provide Gigabit Network if possible (at least between the Cluster Topology).
- Confluence search index located on a fast HDD (SSD).
Tuning Confluence and all other services and systems is a constant process during the life-cycle of Confluence, because sizes, use-cases, software/architecture all change through time and require adjustments. Nonetheless, the following general tuning measures have proven to be useful and thus are applied to our cluster as well:
- Increase the connection pool in Confluence and maximum connections on the database end.
- Use an optimized MySQL configuration.
- Run Tomcat behind an Apache2 Webserver (the Load Balancer) and enable mod_deflate for compression of JS/CSS/HTML. This reduces bandwith and is mostly relevant for public facing instances.
- The Apache2 Load Balancer works is configured to use the thread-based MPM worker with increased values for max connections.
- Configuring a database query timeout (e.g. 30 seconds) to prevent Confluence from becoming unresponsive on long queries.
Tuning the Java Virtual Machine. We found the following parameters to increase the performance significantly in most cases:
# Optimizations from the Java Team -XX:+AggressiveOpts # old parallel GC -XX:+UseParallelOldGC -XX:+DisableExplicitGC # This feature is enabled by default since JDK 6 Update 21 -XX:+UseCompressedOops # Compiler escape analysis, this feature is enabled by default since JDK 6 Update 21 -XX:+DoEscapeAnalysis # String Operation optimizations (if JDK 6 Update 20 or later is installed) -XX:+OptimizeStringConcat # If on Intel CPU with 'Quick Path Interconnect' or AMD Opteron, then this should be enabled -XX:+UseNUMA
- The parameter
-server
is added to the java options (as a side note: in most cases on Linux Server installations, this is the default - but on many Windows installations this is not the case).
- Apart from those optimizations, tuning a productive cluster is trial and error. Because we set up monitoring in a previous step, we can easily verify if any of our tuning has improved performance or not.
- We are currently evaluating the advantages and possiblities of 'MariaDB' as a MySQL replacement. This could be incorporated into this project if the customer desires it and most likely provides better database performance.
Staging / Testing
Before doing any kind of upgrade, for example updating system packages or a new version of Confluence, it is required to test it in a staging environment.
As stated in the requirements above, a separate staging machine is required. A Developer license is also mandatory.
The staging server combines a single node Confluence Cluster installation together with a MySQL database, so that it closely resembles both one of the productive Confluence Nodes as well as the database. This staging server however is primarily suited for system upgrade tests as well as Confluence upgrade tests (see below for steps of such an upgrade test). For performance and tuning tests, we suggest to do them in the productive environment in a specified time frame.
Steps for a system (packages) upgrade test:
- Switch into single user mode.
- Create a LVM snapshot of the root partition.
- Reboot into snapshot (select snapshot from bootloader, which makes this selection temporary).
- Apply upgrade/updates and see if everything is working as expected.
- Reboot and redo upgrade/updates if upgrade was successful.
- Delete the snapshot.
Steps for a Confluence upgrade test:
- Stop Confluence and MySQL on the staging server, if not already stopped.
- Stop MySQL Slave.
- Stop one of the productive Confluence nodes.
- Copy the contents of the home directory from the stopped node to the staging server.
- Copy the MySQL RAW files from the stopped slave to the staging server.
- Start MySQL Slave.
- Start the stopped Confluence node.
- Start MySQL on the staging server with the copied RAW files.
- Modify the Confluence Cluster Configuration to the initial configuration of the staging server.
- Deploy new Confluence version.
- Reapply customizations of the Confluence installation (this can be automated with a VCS, e.g. git).
- Start Confluence and watch the upgrade process (log files).
- Run your checks: Automated tests, manual tests. Check for Plugin compatiblity and verify that none are broken. It is also recommended to do a UAT with few departments and selected users.
Upgrades
If everything is working in the staging environment, the upgrade can be done. During the whole process, log files should be monitored, especially the Confluence logs during possible database upgrades.
Steps for a Confluence upgrade:
- Pick a time frame and communicate it.
- Disabling load balancing for both nodes.
- Enable a maintenance page.
- Shutdown of Confluence on both nodes.
- Shutdown the MySQL Slave, then the MySQL Master (please make sure that both Master/Slave are in sync and all data is written and that there are not open connections).
- Create a LVM snapshot of the Confluence home and install directory on both nodes.
- Create a LVM snapshot of the MySQL database on the master.
- Start MySQL Master.
- Deploying the new Confluence Standalone on both nodes.
- Reapply customizations of the Confluence installation on the first node (this can be automated with a VCS, e.g. git).
- Start Confluence on one node.
Now the first node should be running on the new version. If everything works fine, the following steps can be applied:
- Reapply customizations of the Confluence installation on the second node (this can be automated with a VCS, e.g. git).
- Start Confluence on the second node.
- Start MySQL Slave.
- Disable maintainance page.
- Reenable the Load Balancer.
- Upgrade installed plugins.
- Delete the created snapshots.
- Communicate that everything went smooth
Working with snapshots allows fast recovery if anything goes wrong during the upgrade. Recover steps:
- Shutdown failed Confluence node.
- Shutdown MySQL master.
- Revert to Confluence install and home directory snapshots created (in LVM terms, this is 'merging').
- Revert to the MySQL snapshot created.
- Start MySQL.
- Start the failed node.
- Analyze and resolve the issue outside of production
We recommend to our customers to always make one change at a time and observe (for a longer period, say 2-3 days) what happens. We do not recommend to make multiple changes at the same time (like a system AND Confluence upgrade). We also recommend to track every change and suggest to establish an upgrade checklist.
As to the question when to upgrade and to which version (this is a common question from our customers): We generally suggest to NOT upgrade immediately to first available version on a major version change (e.g. from 3.4.x to 3.5.0). Instead, we suggest to wait for the first or second bug fix release of that major release (e.g. 3.5.2 instead of 3.5.0) and watch the knowledge base and bug tracker for open and resolved issues.