Difference between revisions of "Testing"
Dylanlewis (talk | contribs) (Created page with "This is an area to collect information, techniques, and other resources that are invaluable to understanding and running tests the best way that we know today.") |
Measurechat (talk | contribs) m (Measurechat moved page Testing Channel to Testing: Remove Channel from title) |
||
(7 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
This is an area to collect information, techniques, and other resources that are invaluable to understanding and running tests the best way that we know today. | This is an area to collect information, techniques, and other resources that are invaluable to understanding and running tests the best way that we know today. | ||
+ | == Free Resources == | ||
+ | === Statistical Calculators === | ||
+ | * [https://www.searchdiscovery.com/sample-size-calculator/ Sample Size Calculator by Search Discovery] | ||
+ | * [https://abtestguide.com/bayesian/ Bayesian Test Analysis by AB Test Guide] | ||
+ | * [http://www.socscistatistics.com/tests/goodnessoffit/Default2.aspx Chi Squared Goodness of Fit] (useful for checking sample mismatch) | ||
+ | * [http://www.evanmiller.org/ab-testing/sequential.html Simple Sequential Testing sample/threshold Calculator by Evan Miller] | ||
+ | === Website QA Utilities === | ||
+ | * [https://geopeeker.com/ GeoPeeker]: Allows you to view screenshots of any public URL from IP locations around the globe | ||
+ | * [http://wave.webaim.org/ Accessibility Checker]: Allows you to enter any public URL and received calculated feedback on site accessibility for vision impaired, etc. | ||
+ | * [http://www.webpagetest.org/ Web page speed test by webpagetest.org] | ||
+ | * [https://developers.google.com/speed/pagespeed/insights/ Google page speed insights] | ||
+ | == A/B Testing Platforms List == | ||
+ | Reviews of some mainstream A/B testing tools available from [https://conversionxl.com/blog/ab-testing-tools/ ConversionXL]. | ||
+ | * [https://www.optimizely.com/ Optimizely] | ||
+ | * [https://www.adobe.com/marketing-cloud/target.html/ Adobe Target] | ||
+ | * [https://www.oracle.com/marketingcloud/products/testing-and-optimization/index.html/ Oracle Maxymiser] | ||
+ | * [https://www.monetate.com/ Monetate] | ||
+ | * [https://www.sitespect.com/ Sitespect] | ||
+ | * [https://marketingplatform.google.com/about/optimize/ Google Optimize] | ||
+ | * [https://www.abtasty.com/ AB Tasty] | ||
+ | * [https://vwo.com/ Visual Website Optimizer (VWO)] | ||
+ | * [https://conductrics.com/ Conductrics] | ||
+ | == Server Side Testing Discussion (From Tim Stewart and Cory Underwood) == | ||
+ | === What is server-side testing? === | ||
+ | Almost everyone seems to be offering "server-side" (SS) testing options these days and it’s hawked as the latest must-have for any testing solution. But when you start pushing vendors on the details of what it does and how, you turn up some interesting variations that are often hard to compare to other offerings. | ||
+ | |||
+ | Most people using server-side offerings today are using the vendor to route traffic to internally built variants or components. You'll find a lot of the tools are a combination decision engine and segmentation engine but do not also host and deliver code variants as do most classic A/B testing tools. | ||
+ | === How it differs from client-side and release testing === | ||
+ | A lot of SS testing, in fact probably the largest market share, is carried out by engineering teams using an in-house tool as part of their release process. Traffic is routed via coin flip or 10% soak to Server A or the new experience on Server B. Objectives are to minimize errors and drag on network performance, but few business metrics are typically involved. Using a vendor basically just makes that routing and splitting a bit easier. | ||
+ | Compared to most client-side testing, this is a bit like the difference between tuning an engine and swapping it out completely. A lot of SS testing is swapping the whole engine every time, for any change. But swapping the engine out when you only wanted to tune the injectors is overkill. | ||
+ | |||
+ | Most SS offerings make an API call with some state information, and the API returns 'use code path A, or code path B' then the server does the thing the segmentation engine asked for. Compared to in-house tools, it’s handy if you want to do some sort of advanced segmentation. | ||
+ | |||
+ | If you are just doing straight traffic splits, some CDNs like Akamai allow you to set a cookie which the server can use to generate a specific experience; but because you need to read the cookie, chances are you are not able to cache the base page which has performance consequences. (Unless you have a complex caching structure in place, which poses its own challenges.) | ||
+ | SiteSpect is a special case in SS testing because of how it works within network architecture. It’s similar to a Microservice design in that it has the ability to swap out parts of the site, OR the entire site. For example, you can set it to find code lines 20 to 60 and replace only the bits you want to change. More on SiteSpect below. | ||
+ | |||
+ | One disadvantage to most SS platforms is the effort and cost involved in initial setup vs "one line of JS on page". That’s true whether you’re implementing a vendor platform or adjusting an in-house solution to allow more optimization/marketing-driven testing. But once set up, not only can you run a lot more kinds of tests, a lot of testing also becomes much easier because it’s baked into the entire call stack. | ||
+ | === Cost considerations === | ||
+ | Traditionally, SS testing brings with it a much higher development cost. Remember, often you’re swapping out the whole engine by building out a full solution. Companies must be willing to write off the cost if it fails. Making the case for this level of investment on a project that you might end up throwing away can be difficult at companies limited by short-term shareholder interests and budgets. Of course these might be tests that simply could not be executed with client-side javascript. There’s also the case where a SS vendor allows you to swap out parts rather than redirecting to a fully functional alternative server. But even in these cases, it requires a skill level and QA level that is likely higher than what you'd have been used to. | ||
+ | |||
+ | While you may spend more up front to build the experience, on the other hand, it's normally easier to transition to the default production experience, because the code is already created and ready for production. In other words, you don't throw away the code only to recreate it in the native stack like you would in Client-Side, JS-based scenarios where it’s more like hacking the native experience to fake the new. | ||
+ | === Benefits (aside from flicker effect and delivery speed) === | ||
+ | Moving to SS is a major consideration in time and planning and, until you need it for a specific test case, that cost is large while the benefit is invisible. The reality, though, is if you lack the capabilities introduced with SS, you're only "optimizing" what you can optimize rather everything that could be. Your local maxima is dictated by the platform + vendor limitations and your roadmap caps out at "we can't do that sort of testing, so we can't answer that business question". | ||
+ | |||
+ | Some people struggle to get to the point where they're pushing the limits of that local maxima, so telling them they need to engineer for what happens once they do lest they stall out can be hard to explain. You have to argue that it's worth doing things a different way now so in a year you'll be able to do other styles of testing when you need it. | ||
+ | |||
+ | In the long term the total cost of ownership, the migration to a new approach, re-engineering the network to function with it and vendor cost are all more than offset as the SS approach is used to its full potential. Instead of every complex test (with a client-side tool) being a possibly prohibitive amount of advanced coding and the cascade impact on time to build and difficulty to QA, these sorts of tests have much lower time and technical challenges, therefore cost-viable as the complexity is already handled from the up-front integration. | ||
+ | |||
+ | Unfortunately, as humans we're not great at long term planning. However, without SS testing, sooner or later you’ll end up pitching a more complicated test to dev/IT, and get nothing but shrugs in return because the site doesn't work that way. So some of your most valuable business-changing decisions end up being untested, or possibly tested, but not at the level of segmentation you'd prefer. Or you’re told that you could test it, but that it would require a £400k build by IT and a 6 month wait in order to do so. So you either choke your roadmap and are desperate for it to return the result you expect (begging for p hacking and dodgy interpretation of data), or you simply don't. You flag testing it as prohibitive and the biggest, most dangerous changes get released untested with huge risk. | ||
+ | |||
+ | But if you can convince the org to invest up front in a SS testing platform/framework then you can eliminate a lot of that risk and get a lot more yeses for advanced testing. There’s also the consideration in terms of the overall roadmap and the testing program. If you make the move early you are already on a platform that you can grow into, even if initially you feel over-engineered. Many testing programs stall when they hit the limit of their current vendor and have to re-platform, re-train and re-tool for the next level up. In the real-world this disruption can plateau your testing program for 6 to 12 months - which adds an opportunity cost. Having the overhead in place before its needed means testing cadence can continue and the roadmap progress without these plateaus | ||
+ | === Comparing vendors === | ||
+ | Maxymiser, Target, and Optimizely all have SS solutions that have rolled out in the last two years or so. It would seem that many of these vendors approach selling their SS offerings by trying to explain to the marketers how their solution fits into complex network architecture. Then they expect the marketers to turn around and explain it to their IT teams. But network architecture can vary dramatically in their configuration and most vendors seem to have just their one way in which they fit in the stack. If that doesn't fit the particular routing, load balancing, CDN infrastructure, etc. then it’s going to require quite a bit of investigation. That’s a big shift from "just one line of javascript". | ||
+ | |||
+ | Hence the lack of clarity you see – vendors are trying to describe a potentially very complex set of choices to suit a particular config that the client has. So they don't. That's where the gaps in vendor diagrams and explanations come in. | ||
+ | |||
+ | But that’s also why it’s tough to show a side by side - because you aren't comparing apples with apples. You're saying, “both of these are engines, but all the moving parts are different and specific connectors are needed for my car”. Vendors have all made fundamental choices to suit how they want to set it up (with roughly the same broad idea of SS), but how it works with the network config rather depends on how those servers and networks are built. There is no comparison because your mileage will vary. | ||
+ | === On Sitespect === | ||
+ | Sitespect learned over a decade ago that it’s near impossible to demand that IT completely re-architect a site to suit a vendor platform. So they have lots of options to suit lots of site/network configurations and they know what limitations that might place on testing and how to work around them. | ||
+ | |||
+ | To get ahead of those issues, they tend to engage with the network architecture teams early on because they have to adapt to whatever those people set up. Their setups all share a common backbone, but there are a million different config options to suit what the network people need for a particular case. | ||
+ | |||
+ | Sitespect is a little unusual in that it is baked into the network layer. A user requests the site, DNS returns Sitespect not the Origin server, user is routed through Sitespect, hits Origin, then routed back out through Sitespect, all before anything renders in the client browser. This offers a couple of interesting abilities. | ||
+ | |||
+ | On the way through Sitespect before Origin you can "change" the user request. Tell the Origin the user asked for something different, acted differently, belongs in a different group/segment/location - whatever you need to prompt an alternative response from existing logic. You can even add something new to the request under segmented conditions for example “give them Server B not A”. Effectively pulling different strings to make the Origin bring a different version of the site back as with any other server split test, in-house or vendor, or just parts of it. | ||
+ | |||
+ | On the way back out from the Origin, Sitespect can find/replace any piece of the response file or even add to it like an evar/svar/analytics value so they can be recorded differently (which test, what value, whatever) as well as cosmetic changes. So you can request Recommendation Algo B instead of A or Search Logic Global instead of Search Logic USA but then dress it so it looks like the existing setup. This allows not just new full server versions (built entirely) but testing parts of what you want to release: different logic, checkout flows, validation, redirect logic and then also building how it renders within the tool. | ||
+ | |||
+ | You can also cache bust everything, cache selectively, only pull the values with an API call or an SDK for a react.js or similar single page application (SPA). Sitespect actually has SDKs for all the common frameworks too - so their visual editor actually lets you swap out code chunks from the framework - editing SPAs without rebuilding the entire app has huge advantages. This also means its not reliant on web browsers - so it’s possible to use SDKs and altered server responses to test native Apps, point of sale machines, set top boxes, smart tv app content. Anything where you want to test an altered payload where the server sends the original values and the rules on what to do next. | ||
+ | |||
+ | Ahem, just for the record, Conductrics offered SS run time APIs back in 2010 along with Admin and Reporting APIs. So Sitespect aren’t alone in offering powerful network layer editing. But at the time it seemed most people wanted Web Editors. (Web Editors must die!). That trend seems to continue today - users want all the power, but don’t want to have to deal with complex code. Advances have been made to make this more accessible but the fact remains - if you are given the ability to start messing with fundamental functionality, API response and script logic, rather than just cosmetic changes, then the people doing so need at least a minimum level of ability (and QA oversight) with that stuff to do this safely. Server side removes some of the limitations of client-side to affect these things. It does not remove the need for diligence, indeed with the safeties off and everything unlocked it requires more care and a robust process. | ||
+ | === Why it’s a challenge === | ||
+ | Companies can stall for years to get their company to a place where they can do SS testing. Between the dev staffing and refactoring the network / caching layers, trying to make a solution which has been used for staged release testing to be something which the marketing and optimization team can use (or request the devs to run their tests with) can be a real hassle. It's normally a much harder sell than Client Side testing is, simply because IT is so much more involved in making it work. And frequently this most common solution - in-house server switching lacks key features for testing - control over the randomisation and segmentation, metrics, adaptive logic (if they do this in section 1 show them version 2 of section 2) and are fundamentally lacking when it comes to reporting statistical data. Testing, splitting traffic is one part. Measuring and then analysing that traffic is critical but often lacking within an IT release testing setup. | ||
+ | |||
+ | SS testing is itself often more technical, but a lot of people still say, "I want all the power to test SS, but I'd like it to be non-technical and easy for marketers to bypass IT". This is a whole other can of worms. IT is expensive, and they have a tendency to say 'No', but SS dev work can be very complex and it’s not always in the marketer’s best interest to try to cut IT out of the process. | ||
+ | At that point hopefully you've moved past "change the colour of the button". And whilst a non-coder can totally put a test together, it is rather under-utilising the system’s power. You're no longer just limited to testing cosmetics or suppressing existing client side js. You can swap out libraries or entire chunks of the SPA. | ||
+ | |||
+ | It requires different levels of dev involvement as well and having someone who understands what the tradeoffs are (dev-wise) is critical to doing any sort of advanced testing. The challenge then is getting your head around where you want to make the change in the call stack and what that implies with regard to what you can test. It can be a struggle to explain why this is important. A lot of people don't realize it's a different shift to make a site entirely modular with exchangeable parts. So skilled developers are critical here and both the developer and QA skill set are different here than for general web development and people with these skills are harder to find. | ||
+ | The other issue you may run into with server testing, is the company’s caching architecture. That can be very difficult to work around especially if using heavy amounts of caching on the UI layer so that servers don't melt under load. | ||
+ | === The future === | ||
+ | It’s rather surprising that we haven't seen a bigger and better push from the frameworks, the CMS-es, the site builders and the ecomm platforms to include advanced AB testing and segmentation natively in their platforms. Adding the API hooks and DOM pieces to make any testing easier to scale and reliable seems like a key feature. Big Commerce has a VWO integration, WooCommerce has limited support for Optimize. Adobe Experience Manager also has some limited integration with Adobe Target. None of these do a knock-out job, though. | ||
+ | |||
+ | Even just enforcing devs to build content with unique, clearly identifiable IDs, and letting the testing tool reference those could drastically reduce breakage. A thoughtful approach to site structure and a well-structured dataLayer that helps targeting and logic flow could be fantastic enablers for better testing. The truly advanced CRO-centered organizations understand this and build flexibility into the system at every level to make testing and optimization not just possible, but easier, a core use case, a way of operating day to day. That will inevitably include, perhaps exclusively, SS testing capabilities. |
Latest revision as of 10:43, 10 September 2019
This is an area to collect information, techniques, and other resources that are invaluable to understanding and running tests the best way that we know today.
Contents
Free Resources
Statistical Calculators
- Sample Size Calculator by Search Discovery
- Bayesian Test Analysis by AB Test Guide
- Chi Squared Goodness of Fit (useful for checking sample mismatch)
- Simple Sequential Testing sample/threshold Calculator by Evan Miller
Website QA Utilities
- GeoPeeker: Allows you to view screenshots of any public URL from IP locations around the globe
- Accessibility Checker: Allows you to enter any public URL and received calculated feedback on site accessibility for vision impaired, etc.
- Web page speed test by webpagetest.org
- Google page speed insights
A/B Testing Platforms List
Reviews of some mainstream A/B testing tools available from ConversionXL.
- Optimizely
- Adobe Target
- Oracle Maxymiser
- Monetate
- Sitespect
- Google Optimize
- AB Tasty
- Visual Website Optimizer (VWO)
- Conductrics
Server Side Testing Discussion (From Tim Stewart and Cory Underwood)
What is server-side testing?
Almost everyone seems to be offering "server-side" (SS) testing options these days and it’s hawked as the latest must-have for any testing solution. But when you start pushing vendors on the details of what it does and how, you turn up some interesting variations that are often hard to compare to other offerings.
Most people using server-side offerings today are using the vendor to route traffic to internally built variants or components. You'll find a lot of the tools are a combination decision engine and segmentation engine but do not also host and deliver code variants as do most classic A/B testing tools.
How it differs from client-side and release testing
A lot of SS testing, in fact probably the largest market share, is carried out by engineering teams using an in-house tool as part of their release process. Traffic is routed via coin flip or 10% soak to Server A or the new experience on Server B. Objectives are to minimize errors and drag on network performance, but few business metrics are typically involved. Using a vendor basically just makes that routing and splitting a bit easier. Compared to most client-side testing, this is a bit like the difference between tuning an engine and swapping it out completely. A lot of SS testing is swapping the whole engine every time, for any change. But swapping the engine out when you only wanted to tune the injectors is overkill.
Most SS offerings make an API call with some state information, and the API returns 'use code path A, or code path B' then the server does the thing the segmentation engine asked for. Compared to in-house tools, it’s handy if you want to do some sort of advanced segmentation.
If you are just doing straight traffic splits, some CDNs like Akamai allow you to set a cookie which the server can use to generate a specific experience; but because you need to read the cookie, chances are you are not able to cache the base page which has performance consequences. (Unless you have a complex caching structure in place, which poses its own challenges.) SiteSpect is a special case in SS testing because of how it works within network architecture. It’s similar to a Microservice design in that it has the ability to swap out parts of the site, OR the entire site. For example, you can set it to find code lines 20 to 60 and replace only the bits you want to change. More on SiteSpect below.
One disadvantage to most SS platforms is the effort and cost involved in initial setup vs "one line of JS on page". That’s true whether you’re implementing a vendor platform or adjusting an in-house solution to allow more optimization/marketing-driven testing. But once set up, not only can you run a lot more kinds of tests, a lot of testing also becomes much easier because it’s baked into the entire call stack.
Cost considerations
Traditionally, SS testing brings with it a much higher development cost. Remember, often you’re swapping out the whole engine by building out a full solution. Companies must be willing to write off the cost if it fails. Making the case for this level of investment on a project that you might end up throwing away can be difficult at companies limited by short-term shareholder interests and budgets. Of course these might be tests that simply could not be executed with client-side javascript. There’s also the case where a SS vendor allows you to swap out parts rather than redirecting to a fully functional alternative server. But even in these cases, it requires a skill level and QA level that is likely higher than what you'd have been used to.
While you may spend more up front to build the experience, on the other hand, it's normally easier to transition to the default production experience, because the code is already created and ready for production. In other words, you don't throw away the code only to recreate it in the native stack like you would in Client-Side, JS-based scenarios where it’s more like hacking the native experience to fake the new.
Benefits (aside from flicker effect and delivery speed)
Moving to SS is a major consideration in time and planning and, until you need it for a specific test case, that cost is large while the benefit is invisible. The reality, though, is if you lack the capabilities introduced with SS, you're only "optimizing" what you can optimize rather everything that could be. Your local maxima is dictated by the platform + vendor limitations and your roadmap caps out at "we can't do that sort of testing, so we can't answer that business question".
Some people struggle to get to the point where they're pushing the limits of that local maxima, so telling them they need to engineer for what happens once they do lest they stall out can be hard to explain. You have to argue that it's worth doing things a different way now so in a year you'll be able to do other styles of testing when you need it.
In the long term the total cost of ownership, the migration to a new approach, re-engineering the network to function with it and vendor cost are all more than offset as the SS approach is used to its full potential. Instead of every complex test (with a client-side tool) being a possibly prohibitive amount of advanced coding and the cascade impact on time to build and difficulty to QA, these sorts of tests have much lower time and technical challenges, therefore cost-viable as the complexity is already handled from the up-front integration.
Unfortunately, as humans we're not great at long term planning. However, without SS testing, sooner or later you’ll end up pitching a more complicated test to dev/IT, and get nothing but shrugs in return because the site doesn't work that way. So some of your most valuable business-changing decisions end up being untested, or possibly tested, but not at the level of segmentation you'd prefer. Or you’re told that you could test it, but that it would require a £400k build by IT and a 6 month wait in order to do so. So you either choke your roadmap and are desperate for it to return the result you expect (begging for p hacking and dodgy interpretation of data), or you simply don't. You flag testing it as prohibitive and the biggest, most dangerous changes get released untested with huge risk.
But if you can convince the org to invest up front in a SS testing platform/framework then you can eliminate a lot of that risk and get a lot more yeses for advanced testing. There’s also the consideration in terms of the overall roadmap and the testing program. If you make the move early you are already on a platform that you can grow into, even if initially you feel over-engineered. Many testing programs stall when they hit the limit of their current vendor and have to re-platform, re-train and re-tool for the next level up. In the real-world this disruption can plateau your testing program for 6 to 12 months - which adds an opportunity cost. Having the overhead in place before its needed means testing cadence can continue and the roadmap progress without these plateaus
Comparing vendors
Maxymiser, Target, and Optimizely all have SS solutions that have rolled out in the last two years or so. It would seem that many of these vendors approach selling their SS offerings by trying to explain to the marketers how their solution fits into complex network architecture. Then they expect the marketers to turn around and explain it to their IT teams. But network architecture can vary dramatically in their configuration and most vendors seem to have just their one way in which they fit in the stack. If that doesn't fit the particular routing, load balancing, CDN infrastructure, etc. then it’s going to require quite a bit of investigation. That’s a big shift from "just one line of javascript".
Hence the lack of clarity you see – vendors are trying to describe a potentially very complex set of choices to suit a particular config that the client has. So they don't. That's where the gaps in vendor diagrams and explanations come in.
But that’s also why it’s tough to show a side by side - because you aren't comparing apples with apples. You're saying, “both of these are engines, but all the moving parts are different and specific connectors are needed for my car”. Vendors have all made fundamental choices to suit how they want to set it up (with roughly the same broad idea of SS), but how it works with the network config rather depends on how those servers and networks are built. There is no comparison because your mileage will vary.
On Sitespect
Sitespect learned over a decade ago that it’s near impossible to demand that IT completely re-architect a site to suit a vendor platform. So they have lots of options to suit lots of site/network configurations and they know what limitations that might place on testing and how to work around them.
To get ahead of those issues, they tend to engage with the network architecture teams early on because they have to adapt to whatever those people set up. Their setups all share a common backbone, but there are a million different config options to suit what the network people need for a particular case.
Sitespect is a little unusual in that it is baked into the network layer. A user requests the site, DNS returns Sitespect not the Origin server, user is routed through Sitespect, hits Origin, then routed back out through Sitespect, all before anything renders in the client browser. This offers a couple of interesting abilities.
On the way through Sitespect before Origin you can "change" the user request. Tell the Origin the user asked for something different, acted differently, belongs in a different group/segment/location - whatever you need to prompt an alternative response from existing logic. You can even add something new to the request under segmented conditions for example “give them Server B not A”. Effectively pulling different strings to make the Origin bring a different version of the site back as with any other server split test, in-house or vendor, or just parts of it.
On the way back out from the Origin, Sitespect can find/replace any piece of the response file or even add to it like an evar/svar/analytics value so they can be recorded differently (which test, what value, whatever) as well as cosmetic changes. So you can request Recommendation Algo B instead of A or Search Logic Global instead of Search Logic USA but then dress it so it looks like the existing setup. This allows not just new full server versions (built entirely) but testing parts of what you want to release: different logic, checkout flows, validation, redirect logic and then also building how it renders within the tool.
You can also cache bust everything, cache selectively, only pull the values with an API call or an SDK for a react.js or similar single page application (SPA). Sitespect actually has SDKs for all the common frameworks too - so their visual editor actually lets you swap out code chunks from the framework - editing SPAs without rebuilding the entire app has huge advantages. This also means its not reliant on web browsers - so it’s possible to use SDKs and altered server responses to test native Apps, point of sale machines, set top boxes, smart tv app content. Anything where you want to test an altered payload where the server sends the original values and the rules on what to do next.
Ahem, just for the record, Conductrics offered SS run time APIs back in 2010 along with Admin and Reporting APIs. So Sitespect aren’t alone in offering powerful network layer editing. But at the time it seemed most people wanted Web Editors. (Web Editors must die!). That trend seems to continue today - users want all the power, but don’t want to have to deal with complex code. Advances have been made to make this more accessible but the fact remains - if you are given the ability to start messing with fundamental functionality, API response and script logic, rather than just cosmetic changes, then the people doing so need at least a minimum level of ability (and QA oversight) with that stuff to do this safely. Server side removes some of the limitations of client-side to affect these things. It does not remove the need for diligence, indeed with the safeties off and everything unlocked it requires more care and a robust process.
Why it’s a challenge
Companies can stall for years to get their company to a place where they can do SS testing. Between the dev staffing and refactoring the network / caching layers, trying to make a solution which has been used for staged release testing to be something which the marketing and optimization team can use (or request the devs to run their tests with) can be a real hassle. It's normally a much harder sell than Client Side testing is, simply because IT is so much more involved in making it work. And frequently this most common solution - in-house server switching lacks key features for testing - control over the randomisation and segmentation, metrics, adaptive logic (if they do this in section 1 show them version 2 of section 2) and are fundamentally lacking when it comes to reporting statistical data. Testing, splitting traffic is one part. Measuring and then analysing that traffic is critical but often lacking within an IT release testing setup.
SS testing is itself often more technical, but a lot of people still say, "I want all the power to test SS, but I'd like it to be non-technical and easy for marketers to bypass IT". This is a whole other can of worms. IT is expensive, and they have a tendency to say 'No', but SS dev work can be very complex and it’s not always in the marketer’s best interest to try to cut IT out of the process. At that point hopefully you've moved past "change the colour of the button". And whilst a non-coder can totally put a test together, it is rather under-utilising the system’s power. You're no longer just limited to testing cosmetics or suppressing existing client side js. You can swap out libraries or entire chunks of the SPA.
It requires different levels of dev involvement as well and having someone who understands what the tradeoffs are (dev-wise) is critical to doing any sort of advanced testing. The challenge then is getting your head around where you want to make the change in the call stack and what that implies with regard to what you can test. It can be a struggle to explain why this is important. A lot of people don't realize it's a different shift to make a site entirely modular with exchangeable parts. So skilled developers are critical here and both the developer and QA skill set are different here than for general web development and people with these skills are harder to find. The other issue you may run into with server testing, is the company’s caching architecture. That can be very difficult to work around especially if using heavy amounts of caching on the UI layer so that servers don't melt under load.
The future
It’s rather surprising that we haven't seen a bigger and better push from the frameworks, the CMS-es, the site builders and the ecomm platforms to include advanced AB testing and segmentation natively in their platforms. Adding the API hooks and DOM pieces to make any testing easier to scale and reliable seems like a key feature. Big Commerce has a VWO integration, WooCommerce has limited support for Optimize. Adobe Experience Manager also has some limited integration with Adobe Target. None of these do a knock-out job, though.
Even just enforcing devs to build content with unique, clearly identifiable IDs, and letting the testing tool reference those could drastically reduce breakage. A thoughtful approach to site structure and a well-structured dataLayer that helps targeting and logic flow could be fantastic enablers for better testing. The truly advanced CRO-centered organizations understand this and build flexibility into the system at every level to make testing and optimization not just possible, but easier, a core use case, a way of operating day to day. That will inevitably include, perhaps exclusively, SS testing capabilities.