The new global study, in partnership with The Upwork Research Institute, interviewed 2,500 global C-suite executives, full-time employees and freelancers. Results show that the optimistic expectations about AI’s impact are not aligning with the reality faced by many employees. The study identifies a disconnect between the high expectations of managers and the actual experiences of employees using AI.
Despite 96% of C-suite executives expecting AI to boost productivity, the study reveals that, 77% of employees using AI say it has added to their workload and created challenges in achieving the expected productivity gains. Not only is AI increasing the workloads of full-time employees, it’s hampering productivity and contributing to employee burnout.
They’ve got a guy at work whose job title is basically AI Evangelist. This is terrifying in that it’s a financial tech firm handling twelve figures a year of business-- the last place where people will put up with “plausible bullshit” in their products.
I grudgingly installed the Copilot plugin, but I’m not sure what it can do for me better than a snippet library.
I asked it to generate a test suite for a function, as a rudimentary exercise, so it was able to identify “yes, there are n return values, so write n test cases” and “You’re going to actually have to CALL the function under test”, but was unable to figure out how to build the object being fed in to trigger any of those cases; to do so would require grokking much of the code base. I didn’t need to burn half a barrel of oil for that.
I’d be hesitant to trust it with “summarize this obtuse spec document” when half the time said documents are self-contradictory or downright wrong. Again, plausible bullshit isn’t suitable.
Maybe the problem is that I’m too close to the specific problem. AI tooling might be better for open-ended or free-association “why not try glue on pizza” type discussions, but when you already know “send exactly 4-7-Q-unicorn emoji in this field or the transaction is converted from USD to KPW” having to coax the machine to come to that conclusion 100% of the time is harder than just doing it yourself.
I can see the marketing and sales people love it, maybe customer service too, click one button and take one coherent “here’s why it’s broken” sentence and turn it into 500 words of flowery says-nothing prose, but I demand better from my machine overlords.
Tell me when Stable Diffusion figures out that “Carrying battleaxe” doesn’t mean “katana randomly jutting out from forearms”, maybe at that point AI will be good enough for code.
I, too, work in fintech. I agree with this analysis. That said, we currently have a large mishmash of regexes doing classification and they aren’t bulletproof. It would be useful to see about using something like a fine-tuned BERT model for doing classification for transactions that passed through the regex net without getting classified. And the PoC would be would be just context stuffing some examples for a few-shot prompt of an LLM and a constrained grammar (just the classification, plz). Because our finance generalists basically have to do this same process, and it would be nice to augment their productivity with a hint: “The computer thinks it might be this kinda transaction”
It is suitable when you’re the one producing the bullshit and you only need it accepted.
Which is what people pushing for this are. Their jobs and occupations are tolerant to just imitating, so they think that for some reason it works with airplanes, railroads, computers.
That’s why I have my doubts when people say it’s saving them a lot of time or effort. I suspect it’s planting bombs that they simply haven’t yet found. Like it generated code and the code seemed to work when they ran it, but it contains a subtle bug that will only be discovered later. And the process of tracking down that bug will completely wreck any gains they got from using the LLM in the first place.
Same with the people who are actually using it on human languages. Like, I heard a story of a government that was overwhelmed with public comments or something, so they were using an LLM to summarize those so they didn’t have to hire additional workers to read the comments and summarize them. Sure… and maybe it’s relatively close to what people are saying 95% of the time. But 5% of the time it’s going to completely miss a critical detail. So, you go from not having time to read all the public comments so not being sure what people are saying, to having an LLM give you false confidence that you know what people are saying even though the LLM screwed up its summary.