Skip to main content

IT Crash Course Day 5 (Bonus): Service Desk, Task Automation, and Monitoring

IT Operations Banner

Table of Contents

Ticketing Systems

Fundamentals of Ticketing Systems

A ticketing system (also known as a help desk or service desk system) is a software application that manages and maintains lists of issues reported by users of an organization. It's an essential tool for IT support teams to track, prioritize, and resolve user requests efficiently.

Core Functions of Ticketing Systems:

  1. Issue Tracking: Capture and document user problems in a structured format
  2. Workflow Management: Route tickets to appropriate staff and track progress
  3. Communication: Facilitate communication between users and support staff
  4. Knowledge Management: Build a repository of solutions to common problems
  5. Reporting: Generate metrics on team performance and common issues

Benefits of Using Ticketing Systems:

  • Organization: Prevents issues from being forgotten or overlooked
  • Accountability: Clearly shows who is responsible for each ticket
  • Efficiency: Reduces duplicate work and enables proper resource allocation
  • Documentation: Creates a searchable history of issues and resolutions
  • Metrics: Provides data for continuous improvement
  • Consistency: Ensures standardized support processes

Key Components of a Ticketing System:

Ticketing System Components

Ticket Lifecycle

Understanding the typical lifecycle of a support ticket is essential for effective help desk operations:

1. Creation

  • User submits ticket via email, web form, phone call, chat, or self-service portal
  • System automatically assigns a unique ticket ID
  • Initial metadata is captured (submitter, timestamp, category)

2. Assignment

  • Ticket is assigned to an individual or team based on:
    • Category/type of issue
    • Technical skills required
    • Availability of staff
    • Load balancing
  • Assignment can be manual or automated based on rules

3. Acknowledgment

  • Confirmation sent to user that ticket has been received
  • Initial response often automated but may be personalized
  • Sets expectations for next steps and timeframe

4. Working/In Progress

  • Support staff investigates the issue
  • Updates are added to the ticket
  • Status changes reflect current activity
  • Reassignments may occur if escalation is needed

5. Resolution

  • Solution is implemented
  • Resolution details are documented
  • User may be asked to verify the solution

6. Closure

  • Ticket is marked as resolved/closed
  • Feedback may be requested from the user
  • Knowledge base may be updated with solution

Common Ticket Statuses:

StatusDescription
New/OpenTicket has been created but not yet assigned or acknowledged
AssignedTicket has been allocated to a support staff member
In ProgressWork on resolving the issue has begun
PendingAwaiting information from user or third party
ResolvedIssue has been fixed but awaiting confirmation
ClosedIssue is confirmed fixed and ticket is complete
ReopenedPreviously closed ticket that needs further attention

Ticket Lifecycle

SLAs and Prioritization

Service Level Agreements (SLAs) are commitments between a service provider and the client about the expected level of service. In IT support, SLAs typically define response and resolution times.

Key SLA Components:

  1. Response Time: How quickly support acknowledges the ticket
  2. Resolution Time: How quickly the issue must be resolved
  3. Service Hours: When support is available (e.g., 24/7 or 9-5)
  4. Escalation Process: Steps taken if SLAs are not met

Ticket Priority Levels:

Most ticketing systems use 3-5 priority levels that determine how quickly a ticket needs attention:

PriorityDescriptionTypical Response TimeExample
Critical/P1Service outage affecting multiple users or critical business function15-30 minutesEmail system down for entire company
High/P2Significant impact to business operations1-2 hoursDepartment file share inaccessible
Medium/P3Limited impact, workaround available4-8 hoursPrinter not working, alternative available
Low/P4Minor issue, minimal impact1-2 daysFeature request or non-urgent question

Factors in Priority Determination:

  • Number of users affected
  • Business impact
  • Availability of workarounds
  • Time sensitivity
  • VIP status of affected users
  • Contractual obligations

SLA Calculations:

  • Response Time: Time between ticket creation and first staff response
  • Resolution Time: Time between ticket creation and resolution
  • SLA Compliance Rate: Percentage of tickets resolved within SLA targets
  • Breach Time: When a ticket exceeds its SLA target

Example SLA Targets:

Priority 1 (Critical):
- Response: 15 minutes
- Resolution: 4 hours
- Available: 24x7

Priority 2 (High):
- Response: 1 hour
- Resolution: 8 hours
- Available: Business hours

Priority 3 (Medium):
- Response: 4 hours
- Resolution: 24 hours
- Available: Business hours

Priority 4 (Low):
- Response: 8 hours
- Resolution: 72 hours
- Available: Business hours

Communication Within Tickets

Effective communication within tickets is crucial for efficient resolution and user satisfaction.

Types of Ticket Comments:

  1. Public Comments/Notes

    • Visible to both support staff and end users
    • Used for:
      • Requesting additional information
      • Providing status updates
      • Communicating solutions
      • Asking verification questions
    • Should be professional and jargon-free
  2. Private/Internal Comments

    • Visible only to support staff
    • Used for:
      • Technical notes and troubleshooting steps
      • Internal discussion about the issue
      • Documenting attempted solutions
      • Notes about user interaction
      • Recording sensitive information

Best Practices for Ticket Communication:

  • Be Clear and Concise: Use simple language and avoid technical jargon
  • Document Everything: Record all troubleshooting steps and outcomes
  • Use Templates: Create standard responses for common situations
  • Include Next Steps: Clearly state what happens next and expectations
  • Provide Time Estimates: When possible, give timeframes for resolution
  • Update Regularly: Keep users informed, especially for lengthy issues
  • Use Proper Tone: Remain professional, empathetic, and solution-focused

Common Communication Fields in Tickets:

  • Subject/Title: Concise description of the issue
  • Description: Detailed explanation of the problem
  • Comments/Notes: Ongoing communication thread
  • Resolution Notes: Documentation of the final solution
  • Close Comments: Final notes when closing the ticket

Ticket Communication

Zoho Desk Basics

Zoho Desk is a context-aware help desk software that helps businesses focus on the customer. Here's a basic overview of its key features and functionality.

Zoho Desk Interface Overview:

Zoho Desk Dashboard

Key Features:

  1. Multi-Channel Support

    • Email, phone, chat, social media, and web form integration
    • Automatic ticket creation from any channel
  2. Ticket Management

    • Custom ticket views and filters
    • Automated ticket assignment and routing
    • Customizable workflows and business rules
    • SLA management with escalations
  3. Knowledge Base

    • Create and publish articles
    • Suggest relevant articles based on ticket context
    • Multi-language support
  4. Reporting and Analytics

    • Pre-built and custom reports
    • Performance metrics dashboards
    • Team productivity analysis

Zoho Desk Ticket Workflow:

  1. Ticket Creation

    • Tickets are created through various channels
    • Automatic categorization based on keywords or rules
  2. Assignment and Prioritization

    • Automatic or manual assignment to agents
    • Priority setting based on configurable rules
  3. Working the Ticket

    • Agents update status as they work
    • Internal and external communication options
    • Time tracking for work performed
  4. Resolution and Closure

    • Solution documentation
    • Customer satisfaction surveys
    • Knowledge base updates

Unique Zoho Desk Features:

  • Zia: AI assistant that suggests solutions and detects sentiment
  • Team Supervision: Monitor agent performance in real-time
  • Blueprints: Visual process builders for ticket workflows
  • Extensive Integration: Works with Zoho CRM, Analytics, and third-party apps

Knowledge Check: Ticketing Systems

  1. What is the primary purpose of a ticketing system? a) To record employee work hours b) To track and manage user issues and support requests c) To monitor server performance d) To schedule maintenance windows

  2. What does SLA stand for in IT support? a) System Level Architecture b) Service Level Agreement c) Software License Activation d) Support Liaison Assignment

  3. Which of the following would typically be classified as a Priority 1 (Critical) ticket? a) A request for a new software feature b) A single user unable to print c) Email system down for the entire organization d) A question about how to use a specific application

  4. What is the difference between public and private comments in a ticketing system? a) Public comments are visible to all staff, while private comments are only visible to managers b) Public comments are visible to the end user, while private comments are only visible to support staff c) Public comments are automatically posted to social media, while private comments remain internal d) Public comments are included in reports, while private comments are excluded

  5. Which of the following is NOT typically a standard ticket status in help desk systems? a) In Progress b) Pending c) Transferred d) Profitable

Task Scheduler

Task Scheduling Fundamentals

Task scheduling is the process of automating repetitive tasks to run at predetermined times without manual intervention. This is a crucial aspect of IT operations that improves efficiency and ensures consistency in routine processes.

Core Concepts of Task Scheduling:

  1. Automation: Reducing manual effort for repetitive tasks
  2. Timing Control: Executing tasks at optimal times (off-hours, specific intervals)
  3. Dependency Management: Ensuring tasks run in proper sequence
  4. Resource Optimization: Distributing workload to minimize impact

Benefits of Task Scheduling:

  • Efficiency: Frees up IT staff for more valuable work
  • Consistency: Eliminates human error in routine tasks
  • Reliability: Ensures critical tasks are never forgotten
  • Off-hours Execution: Allows resource-intensive tasks to run during quiet periods
  • Monitoring: Creates logs for verification and troubleshooting

Common Scheduled Task Categories:

  1. Maintenance Tasks

    • Database backups
    • Log rotation and archiving
    • Disk cleanup and defragmentation
    • Temporary file deletion
  2. System Administration

    • System updates and patching
    • Service restarts
    • Health checks and monitoring
    • Report generation
  3. Business Processes

    • Data imports/exports
    • Batch processing
    • File synchronization
    • Automated emails
  4. Security Operations

    • Vulnerability scans
    • Antivirus updates
    • User access reviews
    • Security log analysis

Key Components of Scheduled Tasks:

  • Trigger: When the task should run (time, event, etc.)
  • Action: What the task should do (run program, script, etc.)
  • Conditions: Requirements that must be met to run
  • Settings: Additional options (retry logic, timeout, etc.)
  • Security Context: User account under which task runs

Task Scheduling Concept

Windows Task Scheduler

Windows Task Scheduler is a built-in tool that allows users to automate tasks on a Windows computer. It's accessible to all IT support specialists and provides a solid foundation for understanding task automation concepts.

Accessing Windows Task Scheduler:

  • Search "Task Scheduler" in the Start menu
  • Run taskschd.msc from Run dialog or Command Prompt
  • Access via Control Panel > Administrative Tools > Task Scheduler

Creating Basic Scheduled Tasks:

  1. Open Task Scheduler
  2. Click "Create Basic Task" in the right panel
  3. Enter name and description
  4. Select trigger type (daily, weekly, one-time, etc.)
  5. Configure trigger details (time, days, etc.)
  6. Select action (start program, send email, display message)
  7. Configure action details (program path, arguments)
  8. Review settings and finish

Windows Task Scheduler

Advanced Task Properties:

  • Triggers: Multiple triggers possible
  • Actions: Multiple actions possible
  • Conditions: Additional requirements (idle time, power state)
  • Settings: Execution limits, retry logic, task deletion
  • History: View execution history (success/failure)
  • Security: Run as specific user, highest privileges

Common Task Scheduler Use Cases:

  1. System Maintenance:

    # Disk Cleanup Command
    cleanmgr /sagerun:1

    # Check Disk on Next Reboot
    chkdsk C: /f /r
  2. Backup Scripts:

    # Simple File Backup
    robocopy C:\ImportantData D:\Backup\Data /MIR /R:3 /W:10
  3. Network Operations:

    # Network Connectivity Check
    ping -n 1 google.com > C:\logs\pingtest.log
  4. Application Management:

    # Restart Application Service
    net stop "Application Service" && net start "Application Service"
  5. Report Generation:

    # Export Database Query to CSV
    sqlcmd -S server -d database -E -Q "SELECT * FROM table" -o C:\Reports\daily.csv

Dollar Universe Overview

Dollar Universe is an enterprise-class job scheduling and workload automation platform used by large organizations to manage complex batch processing across multiple systems and applications.

Key Features of Dollar Universe:

  1. Cross-Platform Support:

    • Windows, Unix, Linux, IBM i, z/OS
    • Support for major ERP and database systems
  2. Centralized Management:

    • Single console to manage jobs across all environments
    • Visual workflow designer
  3. Advanced Scheduling:

    • Calendar-based scheduling
    • Event-based triggers
    • Complex dependencies
    • Conditional execution
  4. Enterprise Features:

    • Workload forecasting
    • Resource management
    • SLA monitoring
    • Disaster recovery

Dollar Universe Architecture:

Dollar Universe divides its environment into four "universes" or spaces:

  1. Application (X): Development environment
  2. Simulation (S): Testing environment
  3. Integration (I): Staging environment
  4. Production (P): Live environment

This separation allows jobs to be developed and tested before promotion to production.

Basic Dollar Universe Concepts:

  • Uprocs (Universal Processes): Basic job definitions
  • Sessions: Groups of related Uprocs
  • Management Units (MU): Logical grouping of resources
  • Nodes: Physical or virtual machines where jobs run
  • Launchers: Components that execute jobs
  • Business Views: Custom dashboards for monitoring

Dollar Universe Architecture

Practical Task Scheduling Examples

Below are practical examples of common tasks that are typically automated through scheduling systems, along with the commands or scripts used to execute them.

1. Database Backup

Windows Task (SQL Server):

sqlcmd -S serverName -E -Q "BACKUP DATABASE [CustomerDB] TO DISK='D:\Backups\CustomerDB_full_%date:~-4,4%%date:~-10,2%%date:~-7,2%.bak'"

Linux Cron Job (MySQL):

mysqldump --user=username --password=password database_name > /backup/mysql/database_name_$(date +\%Y\%m\%d).sql

Dollar Universe Task:

# Define variables
DATABASE="CustomerDB"
BACKUP_DIR="/backup/databases"
BACKUP_FILE="$DATABASE_$(date +%Y%m%d).bak"

# Execute backup command
sqlcmd -S $SERVER -E -Q "BACKUP DATABASE [$DATABASE] TO DISK='$BACKUP_DIR/$BACKUP_FILE'"

# Verify success and notify
if [ $? -eq 0 ]; then
echo "Backup successful" | mail -s "Database Backup Success" admin@company.com
else
echo "Backup failed" | mail -s "URGENT: Database Backup Failure" admin@company.com
fi

2. Log File Cleanup

Windows Task:

forfiles /p "C:\logs" /s /m *.log /d -30 /c "cmd /c del @path"

Linux Cron Job:

find /var/log -name "*.log" -type f -mtime +30 -exec rm {} \;

Dollar Universe Task:

# Define retention period and log directory
RETENTION_DAYS=30
LOG_DIR="/var/log/application"

# Remove files older than retention period
find $LOG_DIR -name "*.log" -type f -mtime +$RETENTION_DAYS -exec rm {} \;

# Compress files older than 7 days but within retention
find $LOG_DIR -name "*.log" -type f -mtime +7 -mtime -$RETENTION_DAYS -exec gzip {} \;

# Log action
echo "Log cleanup completed on $(date)" >> $LOG_DIR/cleanup_history.log

3. System Health Check

Windows Task:

@echo off
echo System Health Check: %date% %time% > C:\reports\health_check.log
echo. >> C:\reports\health_check.log
echo Disk Space: >> C:\reports\health_check.log
wmic logicaldisk get deviceid,freespace,size >> C:\reports\health_check.log
echo. >> C:\reports\health_check.log
echo Memory Usage: >> C:\reports\health_check.log
wmic OS get FreePhysicalMemory,TotalVisibleMemorySize >> C:\reports\health_check.log

Linux Cron Job:

#!/bin/bash
REPORT_FILE="/var/log/health_check.log"
echo "System Health Check: $(date)" > $REPORT_FILE
echo "Disk Usage:" >> $REPORT_FILE
df -h >> $REPORT_FILE
echo -e "\nMemory Usage:" >> $REPORT_FILE
free -m >> $REPORT_FILE
echo -e "\nLoad Average:" >> $REPORT_FILE
uptime >> $REPORT_FILE

4. File Synchronization

Windows Task:

robocopy "E:\SourceFiles" "\\server\backup\files" /MIR /R:3 /W:10 /LOG:C:\logs\sync_log.txt

Linux Cron Job:

rsync -avz --delete /source/directory/ user@remote:/destination/directory/ >> /var/log/rsync.log 2>&1

5. Application Service Monitor

Windows Task:

@echo off
sc query "Important Service" | find "RUNNING" > nul
if errorlevel 1 (
net start "Important Service"
echo Service restarted at %date% %time% >> C:\logs\service_restarts.log
)

Linux Cron Job:

#!/bin/bash
if ! pgrep -x "service_name" > /dev/null; then
systemctl start service_name
echo "Service restarted at $(date)" >> /var/log/service_monitor.log
fi

Dollar Universe Complex Job Example:

# Define a complex ETL process with dependencies and notifications

# Step 1: Extract data from source system
EXTRACT_STATUS=$(./extract_data.sh)
if [ $EXTRACT_STATUS -ne 0 ]; then
echo "Extract failed with status $EXTRACT_STATUS" | mail -s "ETL Process Failure" team@company.com
exit $EXTRACT_STATUS
fi

# Step 2: Transform data
TRANSFORM_STATUS=$(./transform_data.sh)
if [ $TRANSFORM_STATUS -ne 0 ]; then
echo "Transform failed with status $TRANSFORM_STATUS" | mail -s "ETL Process Failure" team@company.com
exit $TRANSFORM_STATUS
fi

# Step 3: Load data into data warehouse
LOAD_STATUS=$(./load_data.sh)
if [ $LOAD_STATUS -ne 0 ]; then
echo "Load failed with status $LOAD_STATUS" | mail -s "ETL Process Failure" team@company.com
exit $LOAD_STATUS
fi

# Step 4: Generate reports
REPORT_STATUS=$(./generate_reports.sh)
if [ $REPORT_STATUS -ne 0 ]; then
echo "Report generation failed with status $REPORT_STATUS" | mail -s "ETL Process Failure" team@company.com
exit $REPORT_STATUS
fi

# All steps completed successfully
echo "ETL process completed successfully at $(date)" | mail -s "ETL Process Success" team@company.com

Knowledge Check: Task Scheduler

  1. What is the primary benefit of using task scheduling in IT operations? a) Increasing server security b) Automating repetitive tasks to improve efficiency c) Reducing hardware requirements d) Simplifying user interface

  2. In Windows Task Scheduler, which of the following is NOT a standard trigger type? a) At startup b) On idle c) On network connection d) At a specific time

  3. What command would you use to schedule a task that deletes log files older than 30 days in Windows? a) del C:\logs\*.log /days:30 b) forfiles /p "C:\logs" /s /m *.log /d -30 /c "cmd /c del @path" c) remove-item -path C:\logs -older 30 d) erase C:\logs\*.log -age 30

  4. What is a "Uproc" in Dollar Universe? a) A universal processor b) An upgrade procedure c) A basic job definition (Universal Process) d) An error logging utility

  5. When scheduling backup tasks, what is considered a best practice? a) Always run backups during peak business hours b) Include verification steps and error notification c) Use the same backup destination for all backups d) Run backups as a standard user account

Server Monitoring Basics

Monitoring Fundamentals

Server monitoring is the process of continuously tracking and analyzing server performance, availability, and health to ensure optimal operation and quickly identify issues before they impact users.

Goals of Server Monitoring:

  1. Proactive Issue Detection: Identify problems before they affect users
  2. Performance Optimization: Track resources to maintain optimal performance
  3. Capacity Planning: Collect data for future resource needs
  4. Security: Detect unusual activity that may indicate breaches
  5. Compliance: Meet regulatory and policy requirements
  6. Availability Tracking: Ensure systems remain operational and accessible

Types of Monitoring:

  1. Infrastructure Monitoring:

    • Hardware health
    • Operating system performance
    • Network connectivity
    • Storage utilization
  2. Application Monitoring:

    • Application performance
    • Error rates
    • Response times
    • User experience
  3. Service Monitoring:

    • Service availability
    • Business process completion
    • End-to-end transaction flows
    • Service level agreement compliance

Monitoring Methods:

  1. Agent-Based: Software installed on servers collects and reports data

    • Pros: Detailed information, works behind firewalls
    • Cons: Requires installation, consumes resources
  2. Agentless: External systems check server status without local software

    • Pros: No installation needed, minimal overhead
    • Cons: Less detailed information, network access required
  3. Synthetic Monitoring: Simulating user actions to test functionality

    • Pros: Tests user experience, proactive
    • Cons: May not catch all real-world issues
  4. Real User Monitoring: Tracking actual user interactions

    • Pros: Shows actual user experience issues
    • Cons: Reactive, privacy considerations

Server Monitoring Concept

Key Monitoring Metrics

Understanding critical server metrics is essential for effective monitoring. Here are the key metrics operations teams typically track:

1. CPU Metrics

  • CPU Utilization: Percentage of processor time being used
  • CPU Load: Number of processes waiting for CPU time
  • CPU Queue Length: Tasks waiting for processor time
  • Context Switches: How often CPU switches between tasks

Normal vs. Problematic Values:

  • Normal: 70-80% peak utilization with occasional spikes
  • Problematic: Sustained periods above 90%, consistent high load averages

2. Memory Metrics

  • Memory Utilization: Percentage of RAM in use
  • Available Memory: Amount of free RAM
  • Page File Usage: Virtual memory utilization
  • Swap Activity: Frequency of moving data between RAM and disk

Normal vs. Problematic Values:

  • Normal: 70-80% memory use with adequate free space
  • Problematic: Consistently above 90%, high swap activity

3. Disk Metrics

  • Disk Space: Free vs. used storage
  • I/O Throughput: Data read/write speed
  • IOPS: Input/Output operations per second
  • Disk Queue Length: Commands waiting for disk processing
  • Latency: Time taken to complete disk operations

Normal vs. Problematic Values:

  • Normal: Less than 80% disk capacity, low queue lengths
  • Problematic: Above 90% capacity, high latency, sustained queue length

4. Network Metrics

  • Bandwidth Utilization: Amount of available bandwidth being used
  • Packet Loss: Percentage of packets that fail to reach destination
  • Latency: Time for packets to reach destination
  • Connection Count: Number of active network connections

Normal vs. Problematic Values:

  • Normal: Bandwidth peaks below 70%, minimal packet loss
  • Problematic: Sustained high bandwidth, packet loss above 1%

5. Service Metrics

  • Availability: Percentage of time service is operational
  • Response Time: How quickly service responds to requests
  • Error Rate: Percentage of transactions that fail
  • Throughput: Number of transactions processed

Normal vs. Problematic Values:

  • Normal: 99.9%+ availability, consistent response times
  • Problematic: Increasing error rates, growing response times

Server Metrics Dashboard

Common Monitoring Tools

Operations teams use various tools to monitor server health and performance. Here's an overview of common monitoring solutions:

1. Infrastructure Monitoring Tools

  • Nagios: Open-source monitoring system

    • Checks services, hosts, and network devices
    • Alerts based on thresholds
    • Extensive plugin ecosystem
  • Zabbix: Enterprise-class monitoring solution

    • Monitors servers, networks, and applications
    • Auto-discovery of devices
    • Visualization and reporting
  • PRTG: Network monitoring suite

    • Sensor-based monitoring system
    • Easy-to-use interface
    • Comprehensive device support
  • SolarWinds: Enterprise monitoring platform

    • Network Performance Monitor
    • Server & Application Monitor
    • Storage Resource Monitor

2. Application Performance Monitoring (APM) Tools

  • New Relic: Cloud-based APM solution

    • Real-time performance analytics
    • Code-level diagnostics
    • User experience monitoring
  • Dynatrace: AI-powered monitoring

    • Automatic discovery and mapping
    • Root cause analysis
    • Real user monitoring
  • AppDynamics: Application performance management

    • Business transaction monitoring
    • End-to-end transaction tracing
    • Root cause diagnosis

3. Log Management Tools

  • Splunk: Data aggregation and analysis platform

    • Collects and indexes machine data
    • Advanced search capabilities
    • Visualization and alerting
  • ELK Stack (Elasticsearch, Logstash, Kibana):

    • Open-source log analysis platform
    • Powerful search capabilities
    • Flexible visualization
  • Graylog: Log management and analysis

    • Centralized log collection
    • Structured data parsing
    • Alert mechanisms

4. Cloud Monitoring Services

  • AWS CloudWatch: Amazon Web Services monitoring

    • Metrics collection for AWS resources
    • Logs and events monitoring
    • Custom dashboards and alarms
  • Azure Monitor: Microsoft Azure monitoring

    • Infrastructure and application monitoring
    • Log Analytics
    • Application Insights
  • Google Cloud Monitoring: Google Cloud Platform monitoring

    • Metrics, uptime, and health checks
    • Dashboards and alerting
    • Log management

Basic Monitoring Dashboard Elements:

Monitoring Dashboard

Incident Response

When monitoring detects an issue, operations teams follow structured incident response processes to minimize impact and restore service quickly.

Incident Response Workflow:

  1. Detection

    • Alert triggered by monitoring system
    • User-reported issue
    • Proactive discovery during routine checks
  2. Triage

    • Assess severity and impact
    • Determine priority
    • Assign appropriate resources
  3. Investigation

    • Gather information about the incident
    • Analyze logs and monitoring data
    • Identify potential causes
  4. Containment

    • Prevent incident from spreading or worsening
    • Implement temporary workarounds if possible
    • Protect critical systems and data
  5. Resolution

    • Implement fix to resolve the root cause
    • Test to verify issue is resolved
    • Document solution
  6. Recovery

    • Restore normal operation
    • Validate system functionality
    • Ensure data integrity
  7. Post-Incident Review

    • Analyze what happened and why
    • Document lessons learned
    • Implement preventative measures

Common Server Incidents and Responses:

Incident TypePotential CausesTypical Response Actions
High CPU UsageRunaway process, inadequate resources, malwareIdentify resource-intensive processes, restart services, add resources
Memory DepletionMemory leaks, inadequate allocation, excessive cachingRestart affected services, increase memory, optimize applications
Disk Space IssuesLog files, temporary files, data growthClean up logs, add storage, implement rotation policies
Service OutageFailed updates, configuration errors, hardware failuresRestart services, roll back changes, failover to backup systems
Network ConnectivityHardware failure, configuration issues, bandwidth saturationCheck physical connections, verify network settings, manage traffic
Security BreachUnauthorized access, malware, configuration vulnerabilitiesIsolate affected systems, block malicious traffic, patch vulnerabilities

Alert Severity Levels:

Most monitoring systems categorize alerts by severity levels:

  1. Critical/P1 (Highest)

    • Production service down
    • Significant business impact
    • Immediate response required
    • Examples: Website down, database unavailable
  2. High/P2

    • Partial service degradation
    • High business impact
    • Rapid response required
    • Examples: Slow response times, failed backups
  3. Medium/P3

    • Minor service impact
    • Limited business effect
    • Response during business hours
    • Examples: Non-critical service issues, warnings
  4. Low/P4 (Lowest)

    • No immediate impact
    • Informational in nature
    • Scheduled resolution
    • Examples: Capacity planning alerts, non-urgent patches

Incident Response Process

Knowledge Check: Server Monitoring

  1. What is the primary goal of server monitoring? a) To document server specifications b) To proactively detect and address issues before they impact users c) To justify hardware upgrades d) To track employee productivity

  2. Which of the following would be considered a normal CPU utilization pattern? a) Consistently at 99-100% utilization b) Periodic spikes to 70-80% with normal levels around 20-30% c) Consistently at 0-5% utilization d) Rapid oscillation between 0% and 100% every few seconds

  3. Which monitoring approach requires installing software on the target servers? a) Agent-based monitoring b) Synthetic monitoring c) Ping-based monitoring d) SNMP monitoring

  4. What metric is most useful for determining if a server is experiencing memory pressure? a) CPU temperature b) Network packet loss c) Swap/page file activity d) Disk fragmentation level

  5. During an incident response, what should be done immediately after detecting a critical service outage? a) Schedule a team meeting to discuss the issue b) Document the problem for future reference c) Assess the severity and impact to determine appropriate response d) Immediately restart all server services

Conclusion

This bonus day has covered three essential areas of IT operations that complement the technical skills explored in previous days. Understanding ticketing systems provides the framework for organized support delivery, while task scheduling enables automation of routine activities, freeing up time for more complex issues. Server monitoring ensures systems run optimally and problems are caught early.

These skills form the foundation of efficient IT operations and are valuable for any technical support specialist. As you continue your IT career journey, you'll likely use these concepts daily, regardless of your specific role or industry. The ability to properly track issues, automate routine tasks, and monitor system health are hallmarks of a skilled IT professional.

Remember that the tools and specific implementations may vary between organizations, but the core concepts remain consistent. Focus on understanding the principles behind these practices, and you'll be able to adapt to any environment's specific requirements.