dev-resources.site
for different kinds of informations.
After Excel, What Data Analytics Tool Do We Need
In contemporary business environment, data analytics has already become an important decision-making tool. It enables businesses to better understand market trend, user behaviors and internal operations, and thereby develop well-founded strategies and become more competitive. But data analytics is not easy. You need to choose a suitable tool and method.
Many prefer to choose BI tools because Business Intelligence sounds closely related to data analytics, and the software appears very suitable for the job with its smooth interactivity and gorgeous UI most of the time. Yet BI tools have been technologically narrowed down to multidimensional analysis that mainly focuses on data aggregation and representation based on preset data cubes. Indeed, they can help get insights in certain simple scenarios, such as identifying specific time periods and departments responsible for high costs in an enterprise. However, BI tools get awkward to use when handling complex business computations, such as verifying stock analysts’ guess that stocks that meet certain conditions are apt to rise, helping a sales director find which type of salespeople would be more effective in handling difficult customers, and ecommerce data analysts’ judgment that rewarding customers having certain characteristics at a lower cost can yield higher-than-average revenue.
These tasks all involve multistep interactive computations, which are beyond traditional BI tools ability.
Here is an example. Stock analysts might make a bold guess based on their experience that stocks meeting certain criteria are more likely to rise in the future. In order to verify the guess, they first analyze historical data to find stocks meeting the criteria, and then examine the trends of stocks to see how many actually rise. If most stocks rise, the guess is verified. Otherwise, the guess is wrong. And then they adjust the guess, change the conditions, and continue to verify the guess using the historical data. After multiple iterations of preset conditions, there is a high probability of rising and high growth rates among the finally selected stocks, from which they draw a regular pattern that will guide future stock trading. This is the whole process of handling a data analytics task using the multistep interactive computation. In essence, it repeatedly makes guesses and conducts verifications based on historical data, and ultimately identifies a certain pattern.
Experienced business people make guesses, and the data analytics tool helps them verify the guesses, which means performing certain computations on the historical data. The characteristic of this type of computations is that the process cannot be designed in advance. The analyst decides the next action based on the result of the current step, meaning that they conduct the computation freely. This is the interactive computation, which is similar to the process of using a calculator. Unlike ordinary numerical calculators, data to be computed during the analytics isn’t simple numbers but a batch of tables, which is “structured data”. This capability of data analytics can be aptly described as a tabular data calculator.
Excel is such a tabular data calculator. As a widely used data analytics tool, Excel has good interactivity, and is suitable for processing tabular data. The Excel PivotTable even includes BI’s multidimensional analysis, though with a less attractive interface and less smooth operation. Actually, Excel is the most frequently used data analytics software.
But troubles arose after Excel had been used for data analysis for a long time – certain tasks are difficult to implement with Excel.
Data is too big to handle. Excel has capacity limits. It cannot run smoothly as long as data volume gets larger. It becomes noticeably slow even when dealing with a few hundred thousand rows, and just fails with over a million rows.
Formulas are too difficult to write. Excel is not good at handling set operations and post-grouping computations. For example, in the case of “to find stocks that are more likely to rise” mentioned earlier. One step of the guess-verification procedure might be to find periods of stock price increases lasting more than five days and see if there is a pattern.
This task involves order-based grouping operation and keeping grouped subsets for filtering, which are complex and difficult for Excel.
How about using database? Databases can handle slightly larger data volumes without issue, but the setup can be complex. Excel, on the other hand, requires tedious data import before analysis. Moreover, SQL is often much more difficult than Excel. For example, to compute the longest consecutive rising days for each stock, SQL has the following code:
SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
SELECT CODE, COUNT(*) AS con_rise
FROM (
SELECT CODE, DT, SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
FROM (
SELECT CODE, DT,
CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT) THEN 0
ELSE 1 END AS updown_flag
FROM stock
)
)
GROUP BY CODE, no_up_days
)
GROUP BY CODE
The nested SQL code can be a significant challenge even for professional programmers.
But it is quite simple to handle it with Excel. First, sort rows by code and date; second, write a formula to compute the number of consecutive rising days; third, group rows and summarize dates to find the longest consecutive rising days; fourth, collapse the spreadsheet to display only the desired result. This intuitive four-step process is much simpler than SQL.
BI can’t be relied upon either. BI themselves are not difficult to use, but they still rely on databases to handle big data and face similar setup challenges. Moreover, as mentioned before, since BI has been narrowed down to multidimensional analysis, its computational capability is reduced too much to catch up with SQL. It can’t even handle the above task of getting consecutive rising days for each stock, which is quite a restriction for Excel experts.
Then it seems that programming is the only way to go, as there’s nothing that programming can’t solve.
However, most programming languages have poor interactivity. They are far from the concept of tabular data calculator, and are not suitable for data analytics. VBA enhances Excel computing ability, but it has too weak support for structured data processing, which results in tedious coding process.
In addition, VBA does not have any built-in big data processing capabilities. We need to implement one in it, and the process is more complicated.
What about Python? The language is widely endorsed by training courses everywhere. Python’s tabular data computation ability is decent, but it is still a bit complex to write. For example, to compute the longest consecutive rising days for each stock, it has to turn to hard coding.
import pandas as pd
stock_file = "StockRecords.txt"
stock_info = pd.read_csv(stock_file,sep="\t")
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])
Besides, Python also lacks direct support for big data processing and also requires hard coding, which is far beyond the capabilities of Excel users.
And these programming languages do not have good interactivity. Results can only be viewed after the code is written and executed, and the whole code needs to be executed again when a local error is corrected. This is far from being as user-friendly as WYSIWYG Excel.
As these popular programming languages are not a good choice, is there any other option?
Here is esProc Desktop, a post-Excel, powerful data analytics tool.
esProc Desktop is a tabular data calculator. It provides SPL, a programming language specialized in handling structured data. SPL IDE boasts strong interactivity; the language is convenient and efficient in processing big data and helps enhance Excel’s computing ability. These significantly reduce the difficulty in handling complex computations and make it an ideal tool for data analysts.
WYSIWYG interface, strong interactivity
SPL programming is like writing Excel formulas, even business staff can easily master it. Cell names are natural variable names, which can intuitively reference a previous computing result. SPL supports executing one step at a time. Users can view result of each step at any time on the right panel; the structured data set is automatically displayed in a tabular format. This makes SPL very suitable for handling multistep, interactive computations.
Take the stock rising periods task mentioned earlier as an example. The SPL code can be divided into five steps – reading files, sorting, order-based grouping, filtering, and merging. The interactive process is clearly displayed in the grid.
Each step can be executed separately. If there is an error, we only need to modify the local code, re-execute the corresponding line of code, and check the result.
Powerful gridline form of computing capability
SPL provides powerful gridline form of computing capability to write much simpler code than Python. Not only does it have the simplicity of Excel, but it also handles complex tasks that Excel struggles with, truly achieving a “both-and” solution. For example, calculating the longest consecutive rising days for each stock can be done in just three simple lines.
A more complex task: Find the number of users who are active for three consecutive days in every seven days. The Python code is long-winded:
df = pd.read_csv("../login_data.csv")
df["ts"] = pd.to_datetime(df["ts"]).dt.date
grouped = df.groupby("userid")
aligned_dates = pd.date_range(start=df["ts"].min(), end=df["ts"].max(), freq='D')
user_date_wether_con3days = []
for uid, group in grouped:
group = group.drop_duplicates('ts')
aligned_group = group.set_index("ts").reindex(aligned_dates)
consecutive_logins = aligned_group.rolling(window=7)
n = 0
date_wether_con3days = []
for r in consecutive_logins:
n += 1
if n<7:
continue
else:
ds = r['userid'].isna().cumsum()
cont_login_times = r.groupby(ds).userid.count().max()
wether_cont3days = 1 if cont_login_times>=3 else 0
date_wether_con3days.append(wether_cont3days)
user_date_wether_con3days.append(date_wether_con3days)
arr = np.array(user_date_wether_con3days)
day7_cont3num = np.sum(arr,axis=0)
result = pd.DataFrame({'dt':aligned_dates[6:],'cont3_num':day7_cont3num})
SQL is even more lengthy:
WITH all_dates AS (
SELECT DISTINCT TRUNC(ts) AS login_date
FROM login_data),
user_login_counts AS (
SELECT userid, TRUNC(ts) AS login_date,
(CASE WHEN COUNT(*)>=1 THEN 1 ELSE 0 END) AS login_count
FROM login_data
GROUP BY userid, TRUNC(ts)),
whether_login AS (
SELECT u.userid, ad.login_date, NVL(ulc.login_count, 0) AS login_count
FROM all_dates ad
CROSS JOIN (
SELECT DISTINCT userid
FROM login_data) u
LEFT JOIN user_login_counts ulc
ON u.userid = ulc.userid
AND ad.login_date = ulc.login_date
ORDER BY u.userid, ad.login_date),
whether_login_rn AS (
SELECT userid,login_date,login_count,ROWNUM AS rn
FROM whether_login),
whether_eq AS(
SELECT userid,login_date,login_count,rn,
(CASE WHEN LAG(login_count,1) OVER (ORDER BY rn)= login_count
AND login_count =1 AND LAG(userid,1) OVER (ORDER BY rn)=userid
THEN 0
ELSE 1
END) AS wether_e
FROM whether_login_rn),
numbered_sequence AS (
SELECT userid,login_date,login_count,rn, wether_e,
SUM(wether_e) OVER (ORDER BY rn) AS lab
FROM whether_eq),
consecutive_logins_num AS (
SELECT userid,login_date,login_count,rn, wether_e,lab,
(SELECT (CASE WHEN max(COUNT(*))<3 THEN 0 ELSE 1 END)
FROM numbered_sequence b
WHERE b.rn BETWEEN a.rn - 6 AND a.rn
AND b.userid=a.userid
GROUP BY b. lab) AS cnt
FROM numbered_sequence a)
SELECT login_date,SUM(cnt) AS cont3_num
FROM consecutive_logins_num
WHERE login_date>=(SELECT MIN(login_date) FROM all_dates)+6
GROUP BY login_date
ORDER BY login_date;
SPL code is much shorter and easier to understand, even ordinary business staff can master easily:
Natural big data support
SPL offers native big data processing ability, which supports using cursor to process external memory computations involving data that cannot fit into the memory. It uses almost the same data for performing external memory computations as for in-memory computations, without adding extra workloads. For example, to find the largest number of consecutive rising dates for each stock, we just need to change the import()function to cursor() function.
It also supports multithreaded parallel computing, making most use of the multicore CPU. For example, to find the periods during which a stock has been rising consecutively for more than 5 days, we just add the @m option and the performance soars.
XLL add-in embedded in Excel
esProc Desktop also provide XLL add-in, allowing users to directly write SPL code in the familiar Excel environment by leveraging SPL’s powerful computing ability SPL and the advantages of Excel. To get the consecutively rising periods, just one line of SPL code in Excel is enough.
Another example: Finding the star products that make it to top 10 every month. Performing intersection in Excel is troublesome. SPL’s set operations are much more powerful, and a short formula can get it done directly.
esProc Desktop is now offered for free. It also comes with rich resources, including books, courses, and Excel routines. Find details HERE.
Featured ones: